We compare and contract long-term file system activity for different Unix environments for period... more We compare and contract long-term file system activity for different Unix environments for periods of 120 to 280 days. Our focus is on finding common long-term activity trends and reference patterns. Our analysis shows that 90% of all files are not used after initial creation, those that are used are normally short-lived, and that if a file is not used in some manner the day after it is created, it will probably never be used. Additionally, we find approximately 1% of all files are used daily. This information allows us to more accurately predict the files which are never used. These files can be compressed or moved to tertiary storage enabling either more users per disk or larger user disk quotas.
Massively parallel file systems must provide high bandwidth file access to programs running on th... more Massively parallel file systems must provide high bandwidth file access to programs running on their machines. Most accomplish this goal by striping files across arrays of disks attached to a few specialized I/O nodes in the massively parallel processor (MPP). This arrangement requires programmers to give the file system many hints on how their data is to be laid out on disk if they want to achieve good performance. Additionally, the custom interface makes massively parallel file systems hard for programmers to use and difficult to seamlessly integrate into an environment with workstations and tertiary storage. The RAMA file system addresses these problems by providing a massively parallel file system that does not need user hints to provide good performance. RAMA takes advantage of the recent decrease in physical disk size by assuming that each processor in an MPP has one or more disks attached to it. Hashing is then used to pseudo-randomly distribute data to all of these disks, insuring high bandwidth regardless of access pattern. Since MPP programs often have many nodes accessing a single file in parallel, the file system must allow access to different parts of the file without relying on a particular node. In RAMA, a file request involves only two nodes — the node making the request and the node on whose disk the data is stored. Thus, RAMA scales well to hundreds of processors. Since RAMA needs no layout hints from applications, it fits well into systems where users cannot (or will not) provide such hints. Fortunately, this flexibility does not cause a large loss of performance. RAMA's simulated performance is within 10-15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.
The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large nu... more The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer's local disks. This paper presents an analysis of file migration data collected over two years. The analysis shows that requests to the MSS are periodic, with one day and one week periods. Read requests to the MSS account for the majority of the periodicity; as write requests are relatively constant over the course of a week. Additionally , reads show a far greater fluctuation than writes over a day and week since reads are driven by human users while writes are machine-driven.
In this paper we present the analysis of two large-scale network file system workloads. We measur... more In this paper we present the analysis of two large-scale network file system workloads. We measured CIFS traffic for two enterprise-class file servers deployed in the NetApp data center for a three month period. One file server was used by marketing, sales, and finance departments and the other by the engineering department. Together these systems represent over 22 TB of storage used by over 1500 employees, making this the first ever large-scale study of the CIFS protocol. We analyzed how our network file system workloads compared to those of previous file system trace studies and took an in-depth look at access, usage, and sharing patterns. We found that our workloads were quite different from those previously studied; for example, our analysis found increased read-write file access patterns, decreased read-write ratios, more random file access, and longer file lifetimes. In addition, we found a number of interesting properties regarding file sharing, file re-use, and the access patterns of file types and users, showing that modern file system workload has changed in the past 5–10 years. This change in workload characteristics has implications on the future design of network file systems, which we describe in the paper.
Users are storing ever-increasing amounts of information digitally, driven by many factors includ... more Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public's desire to digitally record their personal histories. Unfortunately, many of the security mechanisms that modern systems rely upon, such as encryption, are poorly suited for storing data for indefinitely long periods of time—it is very difficult to manage keys and update cryptosystems to provide secrecy through encryption over periods of decades. Worse, an adversary who can compromise an archive need only wait for cryptanalysis techniques to catch up to the encryption algorithm used at the time of the compromise in order to obtain " secure " data. To address these concerns, we have developed POT-SHARDS, an archival storage system that provides long-term security for data with very long lifetimes without using encryption. Secrecy is achieved by using prov-ably secure secret splitting and spreading the resulting shares across separately-managed archives. Providing availability and data recovery in such a system can be difficult ; thus, we use a new technique, approximate pointers , in conjunction with secure distributed RAID techniques to provide availability and reliability across independent archives. To validate our design, we developed a prototype POTSHARDS implementation, which has demonstrated " normal " storage and retrieval of user data using indexes, the recovery of user data using only the pieces a user has stored across the archives and the reconstruction of an entire failed archive.
With the growing use of large-scale distributed systems, the likelihood that at least one node is... more With the growing use of large-scale distributed systems, the likelihood that at least one node is compromised is increasing. Large-scale systems that process sensitive data such as geographic data with defense implications, drug modeling, nuclear explosion modeling, and private genomic data would benefit greatly from strong security for their storage. Nevertheless, many high performance computing (HPC), cloud, or secure content delivery network (SCDN) systems that handle such data still store them unencrypted or use simple encryption schemes, relying heavily on physical isolation to ensure confidentiality , providing little protection against compromised computers or malicious insiders. Moreover, current en-cryption solutions cannot efficiently provide fine-grained encryption for large datasets. Our approach, Horus, encrypts large datasets using keyed hash trees (KHTs) to generate different keys for each region of the dataset, providing fine-grained security: the key for one region cannot be used to access another region. Horus also reduces key management and distribution overhead while providing end-to-end data encryption and reducing the need to trust system operators or cloud service providers. Horus requires little modification to existing systems and user applications. Performance evaluation shows that our prototype's key distribution is highly scalable and robust: a single key server can provide 140,000 keys per second, theoretically enough to sustain more than 100 GB/s I/O throughput, and multiple key servers can efficiently operate in parallel to support load balancing and reliability.
Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to pr... more Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to protect storage systems from failures. Most implementations of Galois Field arithmetic rely on multiplication tables or discrete logarithms to perform this operation. However, the advent of 128-bit instructions, such as Intel's Streaming SIMD Extensions, allows us to perform Galois Field arithmetic much faster. This short paper details how to leverage these instructions for various field sizes, and demonstrates the significant performance improvements on commodity microprocessors. The techniques that we describe are available as open source software.
The scale of today's storage systems has made it increasingly difficult to find and manage files.... more The scale of today's storage systems has made it increasingly difficult to find and manage files. To address this, we have developed Spyglass, a file metadata search system that is specially designed for large-scale storage systems. Using an optimized design, guided by an analysis of real-world metadata traces and a user study, Spy-glass allows fast, complex searches over file metadata to help users and administrators better understand and manage their files. Spyglass achieves fast, scalable performance through the use of several novel metadata search techniques that exploit metadata search properties. Flexible index control is provided by an index partitioning mechanism that leverages namespace locality. Signature files are used to significantly reduce a query's search space, improving performance and scalability. Snapshot-based metadata collection allows incremental crawling of only modified files. A novel index versioning mechanism provides both fast index updates and " back-in-time " search of meta-data. An evaluation of our Spyglass prototype using our real-world, large-scale metadata traces shows search performance that is 1-4 orders of magnitude faster than existing solutions. The Spyglass index can quickly be updated and typically requires less than 0.1% of disk space. Additionally, metadata collection is up to 10× faster than existing approaches.
As the world moves to digital storage for archival purposes , there is an increasing demand for r... more As the world moves to digital storage for archival purposes , there is an increasing demand for reliable, low-power, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many disk-based systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes. Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Perga-mum adds NVRAM at each node to store data signatures , metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Perga-mum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the cor-rectness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Perga-mum shows that it provides adequate performance.
We have developed a scheme to secure network-attached storage systems against many types of attac... more We have developed a scheme to secure network-attached storage systems against many types of attacks. Our system uses strong cryptography to hide data from unauthorized users; someone gaining complete access to a disk cannot obtain any useful data from the system, and backups can be done without allowing the super-user access to cleartext. While insider denial-of-service attacks cannot be prevented (an insider can physically destroy the storage devices), our system detects attempts to forge data. The system was developed using a raw disk, and can be integrated into common file systems. All of this security can be achieved with little penalty to performance. Our experiments show that, using a relatively inexpensive commodity CPU attached to a disk, our system can store and retrieve data with virtually no penalty for random disk requests and only a 15–20% performance loss over raw transfer rates for sequential disk requests. With such a minor performance penalty, there is no longer any reason not to include strong encryption and authentication in network file systems.
Stored data needs to be protected against device failure and irrecoverable sector read errors, ye... more Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust, Efficient, Scalable, Autonomous, Reliable) storage, an approach to storage system redundancy that only uses XOR-based parity and employs a graph to lay out data and parity. The RESAR layout offers greater robustness and higher flexibility for repair at the same overhead as a declustered version of RAID 6. For instance, a RESAR-based layout with 16 data disklets per stripe has about 50 times lower probability of suffering data loss in the presence of a fixed number of failures than a corresponding RAID 6 organization. RESAR uses a layer of virtual storage elements to achieve better manageability, a broader potential for energy savings, as well as easier adoption of heterogeneous storage devices.
Data used in high-performance computing (HPC) applications is often sensitive, necessitating prot... more Data used in high-performance computing (HPC) applications is often sensitive, necessitating protection against both physical compromise of the storage media and " rogue " computation nodes. Existing approaches to security may require trusting storage nodes and are vulnerable to a single computation node gathering keys that can unlock all of the data used in the entire computation. Our approach, Horus, encrypts petabyte-scale files using a keyed hash tree to generate different keys for each region of the file, supporting much finer-grained security. A client can only access a file region for which it has a key, and the tree structure allows keys to be generated for large and small regions as needed. Horus can be integrated into a file system or layered between applications and existing file systems, simplifying deployment. Keys can be distributed in several ways, including the use of a small stateless key cluster that strongly limits the size of the system that must be secured against attack. The system poses no added demand on the metadata cluster or the storage devices, and little added demand on the clients beyond the unavoidable need to encrypt and decrypt data, making it highly suitable for protecting data in HPC systems.
We compare and contract long-term file system activity for different Unix environments for period... more We compare and contract long-term file system activity for different Unix environments for periods of 120 to 280 days. Our focus is on finding common long-term activity trends and reference patterns. Our analysis shows that 90% of all files are not used after initial creation, those that are used are normally short-lived, and that if a file is not used in some manner the day after it is created, it will probably never be used. Additionally, we find approximately 1% of all files are used daily. This information allows us to more accurately predict the files which are never used. These files can be compressed or moved to tertiary storage enabling either more users per disk or larger user disk quotas.
Massively parallel file systems must provide high bandwidth file access to programs running on th... more Massively parallel file systems must provide high bandwidth file access to programs running on their machines. Most accomplish this goal by striping files across arrays of disks attached to a few specialized I/O nodes in the massively parallel processor (MPP). This arrangement requires programmers to give the file system many hints on how their data is to be laid out on disk if they want to achieve good performance. Additionally, the custom interface makes massively parallel file systems hard for programmers to use and difficult to seamlessly integrate into an environment with workstations and tertiary storage. The RAMA file system addresses these problems by providing a massively parallel file system that does not need user hints to provide good performance. RAMA takes advantage of the recent decrease in physical disk size by assuming that each processor in an MPP has one or more disks attached to it. Hashing is then used to pseudo-randomly distribute data to all of these disks, insuring high bandwidth regardless of access pattern. Since MPP programs often have many nodes accessing a single file in parallel, the file system must allow access to different parts of the file without relying on a particular node. In RAMA, a file request involves only two nodes — the node making the request and the node on whose disk the data is stored. Thus, RAMA scales well to hundreds of processors. Since RAMA needs no layout hints from applications, it fits well into systems where users cannot (or will not) provide such hints. Fortunately, this flexibility does not cause a large loss of performance. RAMA's simulated performance is within 10-15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.
The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large nu... more The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer's local disks. This paper presents an analysis of file migration data collected over two years. The analysis shows that requests to the MSS are periodic, with one day and one week periods. Read requests to the MSS account for the majority of the periodicity; as write requests are relatively constant over the course of a week. Additionally , reads show a far greater fluctuation than writes over a day and week since reads are driven by human users while writes are machine-driven.
In this paper we present the analysis of two large-scale network file system workloads. We measur... more In this paper we present the analysis of two large-scale network file system workloads. We measured CIFS traffic for two enterprise-class file servers deployed in the NetApp data center for a three month period. One file server was used by marketing, sales, and finance departments and the other by the engineering department. Together these systems represent over 22 TB of storage used by over 1500 employees, making this the first ever large-scale study of the CIFS protocol. We analyzed how our network file system workloads compared to those of previous file system trace studies and took an in-depth look at access, usage, and sharing patterns. We found that our workloads were quite different from those previously studied; for example, our analysis found increased read-write file access patterns, decreased read-write ratios, more random file access, and longer file lifetimes. In addition, we found a number of interesting properties regarding file sharing, file re-use, and the access patterns of file types and users, showing that modern file system workload has changed in the past 5–10 years. This change in workload characteristics has implications on the future design of network file systems, which we describe in the paper.
Users are storing ever-increasing amounts of information digitally, driven by many factors includ... more Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public's desire to digitally record their personal histories. Unfortunately, many of the security mechanisms that modern systems rely upon, such as encryption, are poorly suited for storing data for indefinitely long periods of time—it is very difficult to manage keys and update cryptosystems to provide secrecy through encryption over periods of decades. Worse, an adversary who can compromise an archive need only wait for cryptanalysis techniques to catch up to the encryption algorithm used at the time of the compromise in order to obtain " secure " data. To address these concerns, we have developed POT-SHARDS, an archival storage system that provides long-term security for data with very long lifetimes without using encryption. Secrecy is achieved by using prov-ably secure secret splitting and spreading the resulting shares across separately-managed archives. Providing availability and data recovery in such a system can be difficult ; thus, we use a new technique, approximate pointers , in conjunction with secure distributed RAID techniques to provide availability and reliability across independent archives. To validate our design, we developed a prototype POTSHARDS implementation, which has demonstrated " normal " storage and retrieval of user data using indexes, the recovery of user data using only the pieces a user has stored across the archives and the reconstruction of an entire failed archive.
With the growing use of large-scale distributed systems, the likelihood that at least one node is... more With the growing use of large-scale distributed systems, the likelihood that at least one node is compromised is increasing. Large-scale systems that process sensitive data such as geographic data with defense implications, drug modeling, nuclear explosion modeling, and private genomic data would benefit greatly from strong security for their storage. Nevertheless, many high performance computing (HPC), cloud, or secure content delivery network (SCDN) systems that handle such data still store them unencrypted or use simple encryption schemes, relying heavily on physical isolation to ensure confidentiality , providing little protection against compromised computers or malicious insiders. Moreover, current en-cryption solutions cannot efficiently provide fine-grained encryption for large datasets. Our approach, Horus, encrypts large datasets using keyed hash trees (KHTs) to generate different keys for each region of the dataset, providing fine-grained security: the key for one region cannot be used to access another region. Horus also reduces key management and distribution overhead while providing end-to-end data encryption and reducing the need to trust system operators or cloud service providers. Horus requires little modification to existing systems and user applications. Performance evaluation shows that our prototype's key distribution is highly scalable and robust: a single key server can provide 140,000 keys per second, theoretically enough to sustain more than 100 GB/s I/O throughput, and multiple key servers can efficiently operate in parallel to support load balancing and reliability.
Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to pr... more Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to protect storage systems from failures. Most implementations of Galois Field arithmetic rely on multiplication tables or discrete logarithms to perform this operation. However, the advent of 128-bit instructions, such as Intel's Streaming SIMD Extensions, allows us to perform Galois Field arithmetic much faster. This short paper details how to leverage these instructions for various field sizes, and demonstrates the significant performance improvements on commodity microprocessors. The techniques that we describe are available as open source software.
The scale of today's storage systems has made it increasingly difficult to find and manage files.... more The scale of today's storage systems has made it increasingly difficult to find and manage files. To address this, we have developed Spyglass, a file metadata search system that is specially designed for large-scale storage systems. Using an optimized design, guided by an analysis of real-world metadata traces and a user study, Spy-glass allows fast, complex searches over file metadata to help users and administrators better understand and manage their files. Spyglass achieves fast, scalable performance through the use of several novel metadata search techniques that exploit metadata search properties. Flexible index control is provided by an index partitioning mechanism that leverages namespace locality. Signature files are used to significantly reduce a query's search space, improving performance and scalability. Snapshot-based metadata collection allows incremental crawling of only modified files. A novel index versioning mechanism provides both fast index updates and " back-in-time " search of meta-data. An evaluation of our Spyglass prototype using our real-world, large-scale metadata traces shows search performance that is 1-4 orders of magnitude faster than existing solutions. The Spyglass index can quickly be updated and typically requires less than 0.1% of disk space. Additionally, metadata collection is up to 10× faster than existing approaches.
As the world moves to digital storage for archival purposes , there is an increasing demand for r... more As the world moves to digital storage for archival purposes , there is an increasing demand for reliable, low-power, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many disk-based systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes. Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Perga-mum adds NVRAM at each node to store data signatures , metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Perga-mum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the cor-rectness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Perga-mum shows that it provides adequate performance.
We have developed a scheme to secure network-attached storage systems against many types of attac... more We have developed a scheme to secure network-attached storage systems against many types of attacks. Our system uses strong cryptography to hide data from unauthorized users; someone gaining complete access to a disk cannot obtain any useful data from the system, and backups can be done without allowing the super-user access to cleartext. While insider denial-of-service attacks cannot be prevented (an insider can physically destroy the storage devices), our system detects attempts to forge data. The system was developed using a raw disk, and can be integrated into common file systems. All of this security can be achieved with little penalty to performance. Our experiments show that, using a relatively inexpensive commodity CPU attached to a disk, our system can store and retrieve data with virtually no penalty for random disk requests and only a 15–20% performance loss over raw transfer rates for sequential disk requests. With such a minor performance penalty, there is no longer any reason not to include strong encryption and authentication in network file systems.
Stored data needs to be protected against device failure and irrecoverable sector read errors, ye... more Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust, Efficient, Scalable, Autonomous, Reliable) storage, an approach to storage system redundancy that only uses XOR-based parity and employs a graph to lay out data and parity. The RESAR layout offers greater robustness and higher flexibility for repair at the same overhead as a declustered version of RAID 6. For instance, a RESAR-based layout with 16 data disklets per stripe has about 50 times lower probability of suffering data loss in the presence of a fixed number of failures than a corresponding RAID 6 organization. RESAR uses a layer of virtual storage elements to achieve better manageability, a broader potential for energy savings, as well as easier adoption of heterogeneous storage devices.
Data used in high-performance computing (HPC) applications is often sensitive, necessitating prot... more Data used in high-performance computing (HPC) applications is often sensitive, necessitating protection against both physical compromise of the storage media and " rogue " computation nodes. Existing approaches to security may require trusting storage nodes and are vulnerable to a single computation node gathering keys that can unlock all of the data used in the entire computation. Our approach, Horus, encrypts petabyte-scale files using a keyed hash tree to generate different keys for each region of the file, supporting much finer-grained security. A client can only access a file region for which it has a key, and the tree structure allows keys to be generated for large and small regions as needed. Horus can be integrated into a file system or layered between applications and existing file systems, simplifying deployment. Keys can be distributed in several ways, including the use of a small stateless key cluster that strongly limits the size of the system that must be secured against attack. The system poses no added demand on the metadata cluster or the storage devices, and little added demand on the clients beyond the unavoidable need to encrypt and decrypt data, making it highly suitable for protecting data in HPC systems.
Uploads
Teaching Documents by Ethan L Miller
Papers by Ethan L Miller