CMU CSD PhD Blog - Mimir: Finding cost-efficient storage configurations in the public cloud

In today’s landscape of diverse public cloud providers like AWS, Microsoft Azure, and Google Cloud Platform, organizations are increasingly turning to cloud computing with pay-as-you-go pricing models. Many businesses are adopting public cloud services to simplify data center management or leverage the scalability and elasticity offered by these providers.

A pressing question that accompanies this shift to public cloud adoption is how to optimize the overall cost of utilizing cloud resources. While researchers have recently delved into cost optimization for virtual machine (VM) instances used in computational workloads, there has been limited focus on optimizing storage choices. Frequently, companies require high-performance storage clusters to efficiently operate their workloads in public clouds. However, the costs associated with these storage clusters cannot be underestimated, given that VMs and block storage options can strain budgets.

Thus, companies need to carefully select resources for storage clusters to reduce their Total Cost of Ownership. If organizations opt for only inexpensive resources to minimize costs, their storage clusters may fail to meet performance requirements. Conversely, selecting solely high-performance Virtual Machines and storage types can lead to substantial spending compared to an optimized resource selection approach.

Nonetheless, choosing the cost-efficient set of resources for storage clusters in public clouds remains a challenging task and there is no existing system that helps this provisioning decision. The multitude of available VM and storage types adds complexity. For instance, AWS alone offers over a hundred different instance types and various block storage options, including locally attached (LocalSSD) and remotely disaggregated (EBS). Each resource option comes with distinct pricing and performance models, and the performance also varies based on workload characteristics. This necessitates a deep understanding of both cloud resource attributes and workload characteristics to make informed selections. If we factor in the potential use of heterogeneous storage cluster configurations, the problem’s search space becomes significantly larger and more intricate.

To address these challenges, we introduce Mimir, a resource auto-selection tool designed to identify the most cost-efficient set of resources for storage clusters in public clouds, all while meeting specified performance requirements. Our system assesses all available VM types, block storage options, and even combinations of these options (heterogeneous configurations) to determine the optimal solution. As a result, Mimir can yield storage cluster configurations up to 81% cheaper than those generated by the current state-of-the-art resource auto-selection tools. In our evaluations, we demonstrate that Mimir can also serve as a resource selector for mixed workloads (comprising multiple workloads with distinct characteristics) and dynamic workloads, efficiently identifying cost-effective cluster configurations within a reasonable time.

Challenges: navigating diverse resource options and heterogeneity

Challenge 1: diverse storage options’ characteristics

Public cloud providers have established unique performance characteristics for their block storage options, setting them apart from traditional solutions like SSDs and HDDs. Workload attributes such as access pattern (random/sequential), read ratio, and I/O unit size can exert significant influence on the performance of cloud block storage. Overlooking these factors or assuming that cloud storage behaves analogously to traditional storage can result in erroneous storage cluster configurations. To illustrate this, we present two examples showcasing how workload characteristics impact storage performance.


(a) by I/O unit size	(b) by read ratio

Fig. 1: Performance characteristics of public cloud storage volume types by (a) I/O unit size and (b) workload read ratio. In (a), both volume types have throughput limits defined by AWS (horizontal lines).

In our tests, we employed the fio benchmark to assess cloud block storage performance on AWS, using three different storage types: local NVMe SSD, remote SSD (gp2), and remote HDD (st1). We varied access patterns, read ratios, and I/O unit sizes. Fig. 1 provides insights into the characteristics of 1 TiB gp2 and 1 TiB st1 volumes, each having performance of 3000 IOPS and 40 MiB/s following the performance model provided by AWS, along with local SSD attached to i3.xlarge.

Fig. 1(a) shows how performance characteristics vary with I/O unit size and access pattern for each storage type. For gp2 performance, which is defined in IOPS, increasing the I/O unit size results in higher throughput, eventually reaching the maximum limit set by AWS. Also, it remains consistent regardless of the access pattern. In contrast, st1’s performance, defined in MiB/s, should ideally maintain a throughput of 40 MB/s regardless of the I/O unit size. However, it exhibits reduced throughput for workloads featuring random access patterns and I/O units smaller than 1 MiB, different to the behavior observed with sequential accesses.

In Fig. 1(b), we examine the impact of read ratios on each volume type’s throughput. EBS volumes remain unaffected by the read ratio, as it lies outside their performance models. Conversely, local SSD exhibits considerably higher throughput than EBS and is notably influenced by the read ratio.

As highlighted above, in public clouds, storage performance characteristics differ from traditional storage. For instance, remote SSD throughput remains consistent regardless of read-to-write ratios, while performance of different storage options changes differently for I/O unit size changes. This can confuse users configuring cloud storage clusters, as they may erroneously assume that cloud storage exhibit conventional storage behavior. However, by accurately considering these pricing and performance models, Mimir can mathematically deduce performance specifications from allocated resources, aiding in cost-efficient cloud storage configurations that meet performance needs.

It is worth to note that local SSD’s highest throughput in Fig. 1 does not always make it the best choice. Throughput of each storage option varies with its configuration; larger gp2 and st1 volumes can outperform local SSD. st1 and gp2 come with lower per-byte costs, making them cost-efficient alternatives when high throughput is not crucial.

Challenge 2: heterogeneity is important for cost-efficiency

One easy way of selecting resources for a storage cluster in the public cloud is configuring a homogeneous storage cluster by using a single storage option. However, we found that there is no single storage option that is the most cost-efficient for every workload, and sometimes, even a mix of storage options is needed to minimize the cost.

alt text

Fig. 2: No volume type is most cost-efficient for every workload, and a mix of volume types may be the most cost-effective option.

Fig. 2 demonstrates the need to consider various volume types and configurations for selecting a cost-efficient Virtual Storage Cluster (VSC) configuration. For each of the three workloads, it shows the cost for the best VSC configuration under three constraints: using only local SSD volume types, only remote storage (EBS) volume types, and arbitrary mixes of both.

For Workload 1, which demands high storage throughput per GB of data, opting for EBS volume types leads to over-provisioning capacity, making it an expensive choice due to the 3 IOPS per provisioned-GB limit. Conversely, Workload 2, with lower storage throughput requirements, renders local SSD an expensive option due to over-provisioning storage performance. Workload 3 combines varying performance needs, necessitating a mix of storage options to minimize costs.

Therefore, it is crucial to consider a heterogeneous VSC configuration for the cost-efficiency. However, this introduces complexity, making it impractical to explore the search space using naive methods. So users can use Mimir as a solution to efficiently navigate this complex search space by using dynamic programming and integer-linear programming.

Mimir: resource auto-selector for storage cluster in public clouds

To tackle these challenges, we introduce Mimir, a resource auto-selector that identifies the cost-efficient set of VMs and storage volumes for a storage cluster. Mimir takes into account workload characteristics (such as read/write request ratio and data access locality) and user-defined requirements (including request rate and capacity). Next, we will provide an overview of Mimir’s workflow and delve into our main optimization algorithm.

Mimir design and workflow

Fig. 3 outlines Mimir’s workflow, which begins by inputting characteristics from multiple workloads requiring cluster storage. Each storage cluster’s workload profiler profiles these attributes, and the Resource Profiler assesses them to determine resource needs for cost-effective cloud operations. This involves resource utilization profiling using micro-benchmarks, considering given data access patterns like request rate, access locality, and read/write ratios. The Resource Predictor uses this resource profiling data to identify efficient container sizes (i.e., storage/network bandwidth, CPU count, memory) for each workload, as Mimir utilizes containers to run multiple storage servers in the same VM with resource isolation. Finally, the VSC Cost Optimizer combines these insights with the public cloud’s cost model to optimize the Virtual Storage Cluster (VSC) configuration for the distributed storage system.

alt text Fig. 3: Mimir’s workflow for optimizing the price of public cloud resources. Initially, Mimir profiles the provided workloads, learning the precise resource requirements (such as CPU and memory). Using this trained module and a cost model encompassing public cloud resources, the VSC Cost Optimizer then identifies the most cost-efficient Virtual Storage Cluster (VSC) configuration.

Mimir assumes that users provide or profile the workload characteristics, which the system uses as input for its optimization process. This modular approach makes Mimir adaptable to any storage system capable of profiling sufficient workload information. Next, we provide a brief overview of the optimization algorithm used by Mimir to minimize costs for the given workload characteristics. Further details regarding other components, such as the Resource Profiler and Resource Predictor, can be found in our paper.

Optimization algorithm: dynamic programming

The VSC Cost Optimizer addresses the following question: What resource configuration minimizes costs while meeting performance requirements and accommodating storage workload characteristics?

In this optimization problem, we identified an optimal substructure property. This means that if Mimir determines the most cost-efficient virtual storage cluster configuration for the given workloads, then any subset of storage servers from the entire cluster (a sub-cluster) must also represent the cost-efficient configuration for the workloads running on that specific sub-cluster.

alt text Fig. 4: Mimir’s optimization problem has an optimal substructure property. If we find the most cost-efficient configuration for the entire virtual storage cluster, then any sub-cluster of that configuration must also be cost-efficient for the portion of data stored within that sub-cluster.

Figure 4 exemplifies the optimal substructure property. Suppose Machines 1-4 represent the most cost-efficient VSC configuration for a given workload. We contend that any sub-cluster should also be the most cost-efficient for the portion of the workload it handles. To prove this, we use a proof by contradiction. Let’s assume that Machines 1 and 2 are not the most cost-efficient sub-cluster configuration for 3/10 of the workload. This would imply the existence of another sub-cluster (in this case, Machine 5) that’s cheaper than Machines 1 and 2. However, this contradicts the fact that Machines 1-4 (total VSC cost: $28) constitute the most cost-efficient VSC configuration for the entire workload, given that a cheaper configuration involving Machines 3-5 (total VSC cost: $26) exists.

Based on the optimal substructure property, we use dynamic programming to break down a large search space into manageable segments. For a more in-depth understanding of our approach, including how we use mixed-integer programming for the base case and how Mimir integrates other components (e.g., resource profiler, resource predictor, and cost model) into its optimization algorithm, please refer to our paper.

Mimir can find up to 81% cheaper storage cluster over SOTA

We evaluated Mimir using Apache BookKeeper as the distributed storage backend and six different Meta’s RocksDB (MR) key-value workloads. The results of our evaluation demonstrate significant cost savings achieved by Mimir compared to state-of-the-art solutions, showing its ability to consider a wide range of volume types. We compared Mimir to three baseline configurations, each focusing on a limited subset of instance or storage types, in contrast to Mimir’s comprehensive consideration of all instance and block storage types:

i3.xlarge-only: The simplest way to configure a VSC is using a single instance type (storage-optimized instance, i3.xlarge) and determining the number of instances based on the storage server performance.
Mimir-LocalOnly: Another way is to use only instance types that have local SSDs, including some compute or memory optimized instance types like m5d, c5d, and r5d.
Mimir-EBSonly/OptimusCloud-like: Yet another way of configuring VSC is using only EBS volumes that can persist data independently from the instance status, but if the workload requires high-performance, it can be more expensive than local SSD. OptimusCloud, the previous work we consider as the state-of-the-art, restricts the volume type to EBS volumes because of their persistent nature, but our results show that this approach is often much more costly.

alt text Fig. 5: The cost-efficiency analysis of the optimization results of the workloads of Meta’s RocksDB key-value workloads. Throughput-intensive workloads (MR-A,C,E,F) prefer local SSD as its storage type. In contrast, other workloads (MR-B,D) that do not require high throughput prefer EBS volume to local SSD.

Fig. 5 shows the VSC costs of the most cost-efficient VSC configurations under different resource constraints. Overall, Mimir successfully identifies the most cost-efficient VSC configuration for any workload in Fig. 5, and achieves up to 81% cost savings compared to the OptimusCloud-like baseline. We also demonstrate that depending on the workload characteristics, different workloads prefer different storage types to store data cost-efficiently.

For instance, MR-D, a capacity-intensive workload, does not require high performance. Thus, local SSD proves costly as it under-utilizes its storage bandwidth, and gp2’s throughput (3 IOPS per provisioned GiB) suffices for MR-D.

Conversely, MR-F, with the second highest throughput needs among the six workloads, benefits from local SSD, making Mimir-LocalOnly more cost-efficient than Mimir-EBSonly. Interestingly, for MR-F, compute-optimized instance types like c5d are more economical than storage-optimized i3.xlarge. This is because MR-F demands high computing power for its high data request rate. This evaluation implies that not only considering various storage options, but also selecting the right instance type is important.

Our paper also covers additional evaluations, including the optimization overhead and Mimir’s effectiveness for dynamic workloads. For comprehensive details, please refer to our research paper.

Conclusion

Mimir finds the cost-efficient virtual storage cluster configurations for distributed storage backends. By using provided workload information and performance requirements, Mimir predicts resource requirements and explores the complex, heterogeneous set of block storage offerings to identify the lowest cost VSC configuration that satisfies the customer’s need. Experiments show that no single allocation type is best for all workloads and that a mix of allocation types is the best choice for some workloads. Compared to a state-of-the-art approach, Mimir finds the VSC configurations that satisfy requirements at up to 81% lower cost for Meta’s RocksDB workloads.

You can find more detailed information in our published paper.

Contents