Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3277355.3277406guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

PerfIso: performance isolation for commercial latency-sensitive services

Published: 11 July 2018 Publication History

Abstract

Large commercial latency-sensitive services, such as web search, run on dedicated clusters provisioned for peak load to ensure responsiveness and tolerate data center outages. As a result, the average load is far lower than the peak load used for provisioning, leading to resource under-utilization. The idle resources can be used to run batch jobs, completing useful work and reducing overall data center provisioning costs. However, this is challenging in practice due to the complexity and stringent tail-latency requirements of latency-sensitive services. Left unmanaged, the competition for machine resources can lead to severe response-time degradation and unmet service-level objectives (SLOs).
This work describes PerfIso, a performance isolation framework which has been used for nearly three years in Microsoft Bing, a major search engine, to colocate batch jobs with production latency-sensitive services on over 90,000 servers. We discuss the design and implementation of PerfIso, and conduct an experimental evaluation in a production environment. We show that colocating CPU-intensive jobs with latency-sensitive services increases average CPU utilization from 21% to 66% for off-peak load without impacting tail latency.

References

[1]
Hadoop. http://hadoop.apache.org.
[2]
Intel CAT. https://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html.
[3]
Windows Job Objects. https://msdn.microsoft.com/en-us/library/windows/desktop/hh684161(v=vs.85).aspx.
[4]
Cgroups, 2014. http://en.wikipedia.org/wiki/Cgroups.
[5]
DiskSPD, 2017. https://github.com/Microsoft/diskspd.
[6]
ALIZADEH, M., KABBANI, A., EDSALL, T., PRABHAKAR, B., VAHDAT, A., AND YASUDA, M. Less is more: trading a little bandwidth for ultra-low latency in the data center. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 19-19.
[7]
ARMBRUST, M., XIN, R. S., LIAN, C., HUAI, Y., LIU, D., BRADLEY, J. K., MENG, X., KAFTAN, T., FRANKLIN, M. J., GHODSI, A., ET AL. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015), ACM, pp. 1383- 1394.
[8]
BARROSO, L. A., CLIDARAS, J., AND HÖLZLE, U. The data-center as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture 8, 3 (2013), 1-154.
[9]
CARBONE, P., KATSIFODIMOS, A., EWEN, S., MARKL, V., HARIDI, S., AND TZOUMAS, K. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[10]
DEAN, J., AND BARROSO, L. A. The tail at scale. Communications of the ACM 56, 2 (2013), 74-80.
[11]
DELIMITROU, C., AND KOZYRAKIS, C. Quasar: resource-efficient and QoS-aware cluster management. In ACM SIGPLAN Notices (2014), vol. 49, ACM, pp. 127-144.
[12]
DOUCEUR, J. R., AND BOLOSKY, W. J. Progress-based regulation of low-importance processes. In In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles (1999), ACM Press, pp. 247-260.
[13]
FEDOROVA, A., SELTZER, M., AND SMITH, M. D. A nonwork-conserving operating system scheduler for SMT processors. In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture, in conjunction with ISCA (2006), vol. 33, pp. 10-17.
[14]
ISARD, M. Autopilot: Automatic data center management. ACM SIGOPS Operating Systems Review 41, 2 (Apr. 2007), 60-67.
[15]
JEON, M., HE, Y., KIM, H., ELNIKETY, S., RIXNER, S., AND COX, A. L. TPC: Target-driven parallelism combining prediction and correction to reduce tail latency in interactive services. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (2016), ACM, pp. 129-141.
[16]
KASTURE, H., BARTOLINI, D. B., BECKMANN, N., AND SANCHEZ, D. Rubik: Fast analytical power management for latency-critical systems. In Proceedings of the 48th International Symposium on Microarchitecture (2015), ACM, pp. 598-610.
[17]
KIM, S., HE, Y., HWANG, S.-W., ELNIKETY, S., AND CHOI, S. Delayed-Dynamic-Selective (DDS) prediction for reducing extreme tail latency in web search. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (2015), ACM, pp. 7-16.
[18]
LEVERICH, J., AND KOZYRAKIS, C. Reconciling high server utilization and sub-millisecond quality-of-service. In Proceedings of the Ninth European Conference on Computer Systems (2014), ACM, p. 4.
[19]
LI, T., BAUMBERGER, D., AND HAHN, S. Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In ACM Sigplan Notices (2009), vol. 44, ACM, pp. 65-74.
[20]
LO, D., CHENG, L., GOVINDARAJU, R., RANGANATHAN, P., AND KOZYRAKIS, C. Improving resource efficiency at scale with Heracles. ACM Transactions on Computer Systems (TOCS) 34, 2 (2016), 6.
[21]
LOZI, J.-P., LEPERS, B., FUNSTON, J., GAUD, F., QUÉMA, V., AND FEDOROVA, A. The linux scheduler: a decade of wasted cores. In Proceedings of the Eleventh European Conference on Computer Systems (2016), ACM, p. 1.
[22]
MACE, J., BODIK, P., MUSUVATHI, M., FONSECA, R., AND VARADARAJAN, K. 2dfq: Two-dimensional fair queuing for multi-tenant cloud services. In Proceedings of the 2016 ACM SIGCOMM Conference (New York, NY, USA, 2016), SIGCOMM '16, ACM, pp. 144-159.
[23]
MAKRESHANSKI, D., GICEVA, J., BARTHELS, C., AND ALONSO, G. BatchDB: Efficient isolated execution of hybrid OLTP+ OLAP workloads for interactive applications. In Proceedings of the 2017 ACM International Conference on Management of Data (2017), ACM, pp. 37-50.
[24]
MARS, J., TANG, L., HUNDT, R., SKADRON, K., AND SOFFA, M. L. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture (2011), ACM, pp. 248-259.
[25]
MISRA, P. A., GOIRI, I., KACE, J., AND BIANCHINI, R. Scaling distributed file systems in resource-harvesting datacenters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17) (Santa Clara, CA, 2017), USENIX Association, pp. 799-811.
[26]
NISHTALA, R., CARPENTER, P., PETRUCCI, V., AND MARTORELL, X. Hipster: Hybrid task manager for latency-critical cloud workloads. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on (2017), IEEE, pp. 409-420.
[27]
OUSTERHOUT, K., RASTI, R., RATNASAMY, S., SHENKER, S., CHUN, B.-G., AND ICSI, V. Making sense of performance in data analytics frameworks. In NSDI (2015), vol. 15, pp. 293-307.
[28]
ROHIT, J., AND DAVID, L. CAT at scale: Deploying cache isolation in a mixed workload environment. LinuxCon + Container-Con North America, August 2016.
[29]
SCHURMAN, E., AND BRUTLAG, J. Performance related changes and their user impact. In velocity web performance and operations conference (2009).
[30]
SHUE, D., FREEDMAN, M. J., AND SHAIKH, A. Performance isolation and fairness for multi-tenant cloud storage. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (Hollywood, CA, 2012), USENIX, pp. 349-362.
[31]
VAVILAPALLI, V. K., MURTHY, A. C., DOUGLAS, C., AGARWAL, S., KONAR, M., EVANS, R., GRAVES, T., LOWE, J., SHAH, H., SETH, S., ET AL. Apache hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5.
[32]
VERMA, A., PEDROSA, L., KORUPOLU, M., OPPENHEIMER, D., TUNE, E., AND WILKES, J. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (2015), ACM, p. 18.
[33]
YANG, H., BRESLOW, A., MARS, J., AND TANG, L. Bubbleflux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News (2013), vol. 41, ACM, pp. 607-618.
[34]
YANG, X., BLACKBURN, S. M., AND MCKINLEY, K. S. Elfen scheduling: Fine-grain principled borrowing from latency-critical workloads using simultaneous multithreading. In USENIX Annual Technical Conference (2016), pp. 309-322.
[35]
ZAHARIA, M., XIN, R. S., WENDELL, P., DAS, T., ARMBRUST, M., DAVE, A., MENG, X., ROSEN, J., VENKATARAMAN, S., FRANKLIN, M. J., ET AL. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 56-65.
[36]
ZHANG, W., RAJASEKARAN, S., DUAN, S., WOOD, T., AND ZHUY, M. Minimizing interference and maximizing progress for Hadoop virtual machines. ACM SIGMETRICS Performance Evaluation Review 42, 4 (2015), 62-71.
[37]
ZHANG, X., TUNE, E., HAGMANN, R., JNAGAL, R., GOKHALE, V., AND WILKES, J. CPI2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (2013), ACM, pp. 379-391.
[38]
ZHANG, X., ZHONG, R., DWARKADAS, S., AND SHEN, K. A flexible framework for throttling-enabled multicore management (TEMM). In Parallel Processing (ICPP), 2012 41st International Conference on (2012), IEEE, pp. 389-398.
[39]
ZHANG, Y., PREKAS, G., FUMAROLA, G. M., FONTOURA, M., GOIRI, I., AND BIANCHINI, R. History-based harvesting of spare cycles and storage in large-scale datacenters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 755-770.

Cited By

View all
  • (2024)zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673156(618-628)Online publication date: 12-Aug-2024
  • (2023)Let It Go: Relieving Garbage Collection Pain for Latency Critical Applications in GolangProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592998(169-180)Online publication date: 7-Aug-2023
  • (2022)HermodProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563468(289-305)Online publication date: 7-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
USENIX ATC '18: Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference
July 2018
1019 pages
ISBN:9781931971447

Sponsors

  • VMware
  • NetApp
  • NSF
  • Facebook: Facebook
  • ORACLE: ORACLE

Publisher

USENIX Association

United States

Publication History

Published: 11 July 2018

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673156(618-628)Online publication date: 12-Aug-2024
  • (2023)Let It Go: Relieving Garbage Collection Pain for Latency Critical Applications in GolangProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592998(169-180)Online publication date: 7-Aug-2023
  • (2022)HermodProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563468(289-305)Online publication date: 7-Nov-2022
  • (2022)Improving Concurrent GC for Latency Critical Services in Multi-tenant SystemsProceedings of the 23rd ACM/IFIP International Middleware Conference10.1145/3528535.3531515(43-55)Online publication date: 7-Nov-2022
  • (2021)LineFSProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483565(756-771)Online publication date: 26-Oct-2021
  • (2021)Persistent memory aware performance isolation with dicioProceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3476886.3477517(97-105)Online publication date: 24-Aug-2021
  • (2021)ServerMoreProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486979(570-584)Online publication date: 1-Nov-2021
  • (2021)CERES: Container-Based Elastic Resource Management System for Mixed WorkloadsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472459(1-10)Online publication date: 9-Aug-2021
  • (2021)Understanding, predicting and scheduling serverless workloads under partial interferenceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476215(1-15)Online publication date: 14-Nov-2021
  • (2021)Don't forget the I/O when allocating your LLCProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00018(112-125)Online publication date: 14-Jun-2021
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media