5.2 MIOS Under Production Workloads
We first evaluate the effectiveness of MIOS_D under four Pangu production workloads on the WD 10 TB HDD.
Write Performance. Figure
7 shows that the average and tail latency (99
th and
\(99.9^{th}\)) of all four workloads are significantly reduced by
MIOS_D. Among four workloads, B gains the most benefit. Its average, 99
th and
\(99.9^{th}\)-precentile latencies are reduced by 65%, 85%, and 95% respectively. On the contrary, these three latencies in A are only reduced by about 2%, 3.5%, and 30%, respectively, which is far less than the other workloads. The reason is that the redirection in
MIOS_D is only triggered when the queue length is high, but
A has the least intensity and thus the least queue blocking, which renders MIOS much less useful.
To better understand the root causes for the above experimental results, the
cumulative distribution functions (CDFs) of SSD queue lengths for four workloads are shown in Figure
8.
MIOS_D significantly shortens queue lengths compared to
Baseline. B and A have the maximum (95%) and minimum (15%) reduction in their queue lengths. Therefore, the overall queueing delay is reduced significantly.
Request Size. To deeply understand impact of write size in MIOS and BCW, we break down all redirected requests into six groups with different ranges of IO sizes, and measure the average latency with MIOS_D in each group.
Figure
9 shows that,
MIOS_D reduces the average write latency of size below 64 KB in all four workloads, and workload B benefits the most. The average latencies of three groups of small-sized requests (<4 KB; 4 KB–16 KB; 16 KB–64 KB) are reduced by 61%, 85%, and 59%, respectively. The other three workloads also reduce their latencies differently. In
Baseline, small and intensive requests result in queue blocking more frequently (Figure
2) than in
MIOS_D. Therefore,
MIOS_D is the most effective in reducing latency in such cases.
However, in groups of requests larger than 256 KB, the average latency is increased in all workloads except B. For workload D, the average latency is increased by 31.7% in the >1 MB group, and 12.1% in the 256 KB–1 MB group. The average latency of the 256 KB–1 MB group in C is also increased by 20.1%. The reason is twofold. First, large SSD writes under light load have better performance than HDDs because SSDs have high internal-parallelism that favors large requests. Second, large writes are relatively sparse and not easy to be completely blocked. For example, the average latency of the >256 KB request-size groups in Baseline is very close to the raw SSD write performance.
Queue Length Threshold \(\textbf {L}\). To evaluate the effect of
\(L\) selection, we compare the pre-defined
\(L\) value (
Def) determined by the process described Section
4.2, with
\(L+1\) (
Inc). Note that the pre-defining process for queue length threshold is designed to tradeoff between decreasing the write latency and reducing the write traffic to SSD.
Figure
10(a) shows that
Inc slightly reduces average, 99
th and
\(99.9^{th}\)-percentile latency compared to
Def. Among the four workloads, the maximum reduction in average latency is less than 10%. This is because the higher queue length is, the longer waiting delay a request experiences. Therefore,
Inc can acquire more latency gains by redirection than
Def. However, the choice of
\(L\) value can greatly affect the amount of redirected data. In Figure
10(b), the number of redirected requests is much smaller in
Inc than in
Def. The amount of redirected data for workloads A
\(\sim\)D is decreased by 94%, 64%, 52%, and 62%, respectively. These results are consistent with the implications of Figure
8 that longer queue length in SSD triggers much fewer SSD overuse alerts, significantly reducing chances for request redirecting to HDD.
MIOS_D vs. MIOS_E. We compare
MIOS_D with
MIOS_E in terms of the amount of data written to SSD and HDD, and the number of redirected write requests. Results are shown in Table
5. Workload A has the highest percentage of redirected data and requests with
MIOS_E. It reduces the SSD written data by up to 93.3% compared with
Baseline, which is significantly higher than
MIOS_D. Since workload A has lower IO intensity,
MIOS_E has more chances to redirect even when the queue length is low. Note that we also count the padded data in BCW as the amount of data written in HDD. In such a case the total amount of data written can vary a great deal. Workload B has the lowest percentage of redirection with
MIOS_E, which reduces SSD written data by 30%. Nevertheless, the absolute amount of redirected data is very large because the SSD written data in
Baseline is larger than any of the other three workloads. Compared with
MIOS_D,
MIOS_E can greatly decrease the amount of data written to SSD. Therefore, it is more beneficial to alleviate SSD wear-out.
However, the negative effect of
MIOS_E is the increase of average and tail latency. In Figure
11(a),
MIOS_E leads to generally higher average latency than
MIOS_D by up to 40% under workload A. Although for other three workloads, the average latency remains basically unchanged. This is because much more writes (i.e., >90%) are redirected by
MIOS_E than by
MIOS_D in workload A, and requests in HDD experience longer latency than that in SSD. Moreover, the
\(99.9^{th}\)-percentile latency of
MIOS_E is increased by 70% in A, 55% in B, 31% in C, and 8% in D compared to
MIOS_D. The results can be explained by Figure
11(b).
MIOS_E increases the average latency for nearly all the IO size groups, especially for the groups with requests of size larger than 256KB.
Comparison with Other HDD Writing Mechanisms. We compare BCW with other HDD writing mechanisms. The first one is
BCW_OF that only uses the
\(F\) write state by proactively issuing
\(sync()\) when the ADW reaches
\(W_{f}\) in a HDD. The second one is
Logging [
36,
48], which is an ordinary way to improve the performance of HDD-writes by storing written data in an append-only manner. The difference between
Logging and
BCW is that the former cannot predict or selectively determine to use low-latency HDD buffered write states while the latter can. We measure the average, 99
th,
\(99.9^{th}\)-percentile latency and the SSD written data reduction with
BCW_OF and
Logging. We combine all these HDD write mechanisms with MIOS scheduler that uses the MIOS_E strategy. We take MIOS_E with
BCW as the baseline and present performance normalized to it.
Figure
12(a) shows that the
\(99.9^{th}\)-percentile latency of
BCW_OF is increased by 2.19
\(\times\) over
BCW in workload C. The 99
th-percentile latency in the B, C, and D workloads also increases by 2.5
\(\times\), 1.1
\(\times\), and 1.9
\(\times\), respectively. It means that
BCW_OF performs less efficiently on reducing tail latency when the workload becomes heavier. This is because such mechanism redirects less requests when SSD suffers queue blockage. Figure
12(b) shows that, the data volume of SSD redirection is reduced by 71% in workload A and 26% in workload B. As mentioned in Section
4.2,
\(sync()\) is a high cost (e.g., tens of milliseconds) operation to flush the HDD buffer and HDDs cannot serve any requests during this time window.
Furthermore, Logging can reduce 10% more SSD written data than BCW at the cost of explicit write latency increase. The \(99.9^{th}\)-percentile latency in the B, C, and D workloads is 2.0\(\times\), 1.9\(\times\), and 1.6\(\times\) higher than BCW, reaching several milliseconds. This is because Logging cannot prevent write requests to encounter the \(S\) write states during intensive workloads.
Comparison with Existing IO Scheduling Approaches. LBICA [
1] and SWR [
38] are the state-of-the-art IO scheduling approaches in SSD-HDD hybrid storage. In the write-dominated workload, LBICA redirects requests from the tail of the SSD queue to the HDDs only when SSD has long queue length. Besides, SWR redirects large size write requests preferentially and allows redirecting when the SSD queue length is low. We normalize the performance of
LBICA and
SWR to MIOS_D and MIOS_E, respectively.
Figure
13(c) illustrates that
LBICA increases the average and tail latency by up to 2.0
\(\times\) and 6.2
\(\times\) over MIOS_D in workload B.
LBICA calculates the redirect threshold
\(L\) by comparing the average latency of random HDD-writes with that of the SSD-writes in tail of the queue, leading to a high
\(L\) value. It means that
LBICA can only redirect requests when SSD suffers high queue length, which is a rare condition. Figure
13(d) shows that the SSD data reduction of
LBICA is decreased 70%–90% than MIOS_D. Compared to
LBICA, MIOS_D performs SSD-write redirection under a lower queue length, because BCW could provide
\(\mu s\)-level latency for HDD-writes.
Figure
13(a) shows that the average and tail latency of
SWR is 1.2
\(\times\) –2.1
\(\times\) higher than MIOS_E. There are two main reasons. First,
SWR prefers to redirect requests with large IO size. However, Figure
9 indicates that the performance degradation of redirecting large size request is more than that of a small one, due to the high internal parallelism of SSDs. Second,
SWR cannot take full advantage of the low-latency HDD write states and results in increased write latency. The advantage of
SWR scheduling scheme is that it could efficiently perform redirection under high IO intensity. Figure
13(b) shows that the amount of SSD write reduction of
SWR is 3.4
\(\times\) more than that of MIOS_E in workload B.
Multi-HDD Scheduling. We measure the average, 99th and \(99.9^{th}\)-percentile latency of write requests with \(N\) values from 1 through 4. When \(N=1\), the multi-HDD scheduling mechanism cannot be performed. We take the MIOS_E with \(N=1\) as the baseline and present performance and data redirection normalized to it, because multi-HDD scheduling is more adaptable to high-intensity workloads.
Figure
14(a) shows that the multi-HDD scheduling provides a consistent improvement in request latency over single-HDD. The average latency of
\(N=4\) is 30% lower compared to the baseline (i.e.,
\(N=1\)) in workload A, and that is 17% higher in other three workloads. This is because the multi-HDD scheduling undertakes most redirection requests with the best-performing
\(F\) write state. Meanwhile, the normalized 99
th-tail latency on
\(N=4\) is decreased 50% than baseline in workload B, 20% in workload A and C, and 10% in workload D. It redirects more requests under high IO intensity, which further relieves SSD pressure. However, the
\(99.9^{th}\) tail latency is not significantly reduced, which even presents an increase of 3% and 23% in workload A and C, respectively. In the multi-HDD scheduling, MIOS redirects more requests to HDDs when the number of HDDs increases. Particularly, the number of redirected writes with the 0.5% largest IO size increase, which experience longer latency when written to HDD than that to SSD. It results in the increase of
\(99.9^{th}\)-percentile latency.
Figure
14(b) demonstrates that the multi-HDD scheduling further improves SSD data reduction. In workload B with the most amount of written data,
\(N=4\) could redirect 4
\(\times\) more data than
\(N=1\), while
\(N=2\) or
\(N=3\) is enough to redirect more than 92% of requests in other three workloads. We also measure the performance of multi-HDD scheduling with the
MIOS_D strategy. We find that two HDDs are typically enough for redirection in all four workloads, and more disks present little improvement. This is because
MIOS_D just performs redirection on IO bursty.
Experiment with Other HDDs. We use the 4 TB WD, 4 TB Seagate, and the 10 TB WD HDD to replay workload B, comparing MIOS_D (with the default \(L\) value) with the Baseline in terms of the request latency and the amount of data written to SSD. Workload B is chosen for this experiment, since it has the most SSD written data and the most severe SSD queue blockage, clearly reflecting the effect of IO scheduling.
Figure
15 shows that different models of HDDs do not have a significant impact on the effect of
MIOS_D. First, the average and tail latencies for all the three HDDs are virtually identical, with a maximum difference of less than 3%. In addition, in the six request-size groups, only the >1 MB group exhibits a large difference among different HDD models. The average latency of the 10 TB WD HDD is 14% lower than that of the other 4 TB HDDs. This is because of the native write performance gap between them. It can be found from Table
6 that different models of HDDs do not notably affect the amount of data redirected, with little difference of less than 5%.
Experiment with Other SSDs. Next, to further explore the effect of MIOS with different types of SSDs, we first deploy a lower-performance 660p SSD. We replay the same workload A that provides the lowest IO pressure, and employ the
MIOS_D and
MIOS_E strategies, respectively. From the latency CDF shown in Figure
16(a), when using the lower-performing SSD, more than 7% of the requests are severely affected by long queuing delay and the maximum queue length reaches up to 2,700. It surpasses the experiment result with the better-performing 960EVO (e.g., 23 shown in Figure
8(b)). This is because when the IO intensity exceeds the ability of 660p SSD to accommodate, the SSD queue length builds up quickly. As a result in Figure
16(b), the average and tail latencies in
Baseline rise sharply compared with 960EVO SSD shown in previous Figure
7. The average latency in
Baseline is 90 ms and the 99
th-percentile latency exceeds 5 second.
With such high pressure on the 660p, MIOS can help reduce some IO burden on SSD by redirecting queued requests to HDDs. As seen from Figure
16,
MIOS_D decreases the queue blockage with 45% queue length reduction to the maximum. At the same time, the average latency in
MIOS_E returns to
\(\mu\)s-level (e.g., 521
\(\mu\)s), and the 99
th and
\(99.9^{th}\)-percentile latencies are reduced to an acceptable range of 2.4 ms and 87 ms, respectively. Because
MIOS_E redirects much more SSD requests with low queue length, it prevents queue blockage in SSD, particularly for a lower-performing one. By comparing this experiment on a lower-performing SSD with an earlier one on a higher-performing SSD, we believe that when the SSD in hybrid storage cannot support the intensity of a write-dominated workload, MIOS and BCW can provide an effective way to improve the overall IO capacity by offloading part of the workload from SSD to HDD.
To further explore the effect of MIOS with the Intel Optane SSD, we replay the workload B that provides the highest IO pressure and employ the MIOS_D strategies. We do not use MIOS_E here because it is an aggressive strategy that redirects more requests to HDDs than MIOS_D. The excessive latency performance gap between HDDs and Optane SSDs will result in unacceptable performance degradation. Figure
17 shows that MIOS_D is still effective in the hybrid storage when using Optane SSD as primary write buffer. The
\(99.9^{th}\)-tail latency is reduced by up to 19% and it also decreases the queue blockage with 60% queue length reduction to the maximum in workload B. Although the Optane SSD has less GC overhead than NAND-based SSD, they have similar ability to handle large size IOs (e.g., Optane SSD takes 230
\(\mu s\) while 960EVO takes 260
\(\mu s\) for a 512 KB write IO in average). Therefore, the intensive write workload can also block queues and increase tail latency in Optane SSD.
In addition, we compare BCW to a system that simply adds an extra SSD. We equally distribute the workloads to two SSDs. The system can achieve the same or even better latency than MIOS_E, but at a significantly increased hardware cost.
Read Performance. We measure the average, 99th and \(99.9^{th}\)-percentile latency of external read requests (i.e., user reads) with MIOS_D, MIOS_E, SWR, and LBICA in all four workloads. We present all approaches normalized to the Baseline where all user data are directly written into the SSDs and then dutmped to HDDs.
The read performance of MIOS decreases slightly compared with that of the Baseline. Figure
18 indicates that the average read latency in
MIOS_E is increased by 27%, 23%, 14%, and 28% in workload A, B, C, and D, respectively. The reasons are twofold. First, compared to Baseline, MIOS bypasses more SSD-writes directly to HDDs in advance, indicating that the HDDs have to serve more read operations. However, the majority of read requests in origin Pangu workloads are HDD-reads. For example, the proportion of SSD-reads in the total number of requests are less than 1.5% in all workloads. This is because Pangu periodically flushes data from SSD to HDD, and most of the data are already stored in HDDs. Therefore,
MIOS_E turns up to 31% SSD-reads to HDD-reads and affects these average read latency slightly. Second, BCW uses the append manner to improve write performance and actively pads non-user data to HDDs. It results in the file data being scattered on HDD. However, we find that the overhead of BCW on HDD-reads is relatively small. This is because most read requests in these Pangu workloads are discrete spatially and temporally, i.e., random HDD-reads. In addition, the average size of the read requests is similar to that of writes, e.g., 4-16 KB, so that BCW rarely divides a read request into multiple read IOs. Note that, as described in Section
4.3, MIOS periodically writes logging data back to their interrelated chunk files, meaning that the long-term read performance in MIOS is identical with that in
Baseline. Moreover, the increases of the 99
th- and
\(99.9^{th}\)-percentile latency are less pronounced with
MIOS_D and
MIOS_E. Because a large part of the tail read latency is caused by queue blocking, which could get to hundreds of milliseconds. We also observe that read latency increases more with MIOS than that with
SWR, since MIOS leads to more HDD-reads and has worse data locality in HDD. Besides,
LBICA has little effect on both average and tail read latency because it redirects far less SSD write requests than other approaches.
We discuss the tradeoffs between improved write performance and degraded read performance caused by MIOS and BCW. In short, MIOS_D reduces the average and tail latency of write requests, which account for 84%-98% of the total requests, by 65% and 95%, respectively. In contrast, those of read requests, which account for 2%–16% of the total requests, rises by 7% and up to 10% respectively. For the MIOS_E strategy, the average and tail latency of write requests is reduced by up to 60% and 85%, and those of read requests increase by 28% and 18%, respectively. Therefore, we indicate MIOS approach gains more benefits for applications with write-dominated IO patterns, in which a few reads with limited latency increase could be accepted. Users can determine the appropriate scheduling policy according to current IO patterns. MIOS_E is more suitable for the burst and intensive-write environments, while MIOS_D can be used for write-dominated workloads. In addition, MIOS can be disabled in read-dominated situations.