research-article

Alita: comprehensive performance isolation through bias resource management for public clouds

Authors:

Minyi GuoAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 32, Pages 1 - 13

Published: 09 November 2020 Publication History

Abstract

The tenants of public cloud platforms share hardware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.

References

[1]

3ds max. www.autodesk.com/products/3ds-max.

[2]

Alibaba Cloud. www.alibabacloud.com.

[3]

Amazon AWS. www.aws.amazon.com.

[4]

Autodesk rendering. www.autodesk.com/products/rendering.

[5]

Azure batch rendering. www.azure.microsoft.com/en-in/services/batch/rendering.

[6]

Blender. www.blender.org.

[7]

Developers split over split-lock detection. https://lwn.net/Articles/806466/.

[8]

Google Cloud. www.cloud.google.com.

[9]

Google cloud rendering. www.zyncrender.com.

[10]

Houdini. www.sidefx.com.

[11]

Maya. www.autodesk.com/products/maya.

[12]

Microsoft Azure. www.azure.microsoft.com.

[13]

Split lock detection sent in for linux 5.7 to spot performance issues, unprivileged dos. www.phoronix.com/scan.php?page=news\_item\&px=Linux-5.7-Split-Lock-Detection.

[14]

Intel R 64 and IA-32 Architectures Software Developer's Manual. Volume 3b: System Programming Guide (Part 2), 2013.

[15]

Steven F Barrett and Daniel J Pack. Microcontrollers fundamentals for engineers and scientists. Synthesis Lectures on Digital Circuits and Systems, 1(1):1--124, 2005.

[16]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT, pages 72--81, New York, NY, USA, 2008. ACM.

Digital Library

[17]

Len Brown. Ubuntu 14.04 manpages: turbostat.

[18]

Shuang Chen, Christina Delimitrou, and José F Martínez. Parties: Qos-aware resource partitioning for multiple interactive services. In ASPLOS, pages 107--120. ACM, 2019.

Digital Library

[19]

Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-aware Scheduling for Heterogeneous Datacenters. In ASPLOS, pages 77--88, New York, NY, USA, 2013. ACM.

Digital Library

[20]

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and QoS-aware Cluster Management. In ASPLOS, pages 127--144, New York, NY, USA, 2014. ACM.

Digital Library

[21]

Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean Tullsen. Prime+abort: A timer-free high-precision l3 cache attack using intel TSX. In USENIX Security, pages 51--67, 2017.

[22]

Jack J Dongarra, Cleve Barry Moler, James R Bunch, and Gilbert W Stewart. LINPACK users' guide. SIAM, 1979.

[23]

Tracy Fullerton, Jenova Chen, Kellee Santiago, Erik Nelson, Vincent Diamante, Aaron Meyers, Glenn Song, and John DeWeese. That cloud game: dreaming (and doing) innovative game design. In ACM SIGGRAPH symposium on Videogames, pages 51--59, 2006.

Digital Library

[24]

Karthik Ganesan and Lizy K John. Maximum multicore power (mampo): an automatic multithreaded synthetic power virus generation framework for multicore systems. In SC, page 53. ACM, 2011.

Digital Library

[25]

Chaima Ghribi, Makhlouf Hadji, and Djamal Zeghlache. Energy efficient vm scheduling for cloud data centers: Exact allocation and migration algorithms. In International Symposium on Cluster, Cloud, and Grid Computing, pages 671--678. IEEE, 2013.

Digital Library

[26]

Intel. Intel Resource Director Technology. 2016.

[27]

Mohammad A Islam and Shaolei Ren. Ohm's law in data centers: A voltage side channel for timing power attacks. In CCS, pages 146--162. ACM, 2018.

[28]

Mohammad A Islam, Shaolei Ren, and Adam Wierman. Exploiting a thermal side channel for power attacks in multi-tenant data centers. In CCS, pages 1079--1094. ACM, 2017.

[29]

Thomas Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Transactions on Communication Technology, 15(1):52--60, 1967.

[30]

Harshad Kasture and Daniel Sanchez. Ubik: Efficient cache sharing with strict qos for latency-critical workloads. In ASPLOS, pages 729--742, New York, NY, USA, 2014. ACM.

Digital Library

[31]

Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1--10. IEEE, 2016.

[32]

Etienne Le Sueur and Gernot Heiser. Dynamic voltage and frequency scaling: The laws of diminishing returns. In Proceedings of International Conference on Power Aware Computing and Systems, pages 1--8, 2010.

[33]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In ISCA, pages 450--462. ACM, 2015.

Digital Library

[34]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Micro, pages 248--259, New York, NY, USA, 2011. ACM.

[35]

Philip J Mucci, Kevin London, and John Thurman. The cachebench report. University of Tennessee, Knoxville, TN, 19, 1998.

[36]

Michael Nelson, Beng-Hong Lim, and Greg Hutchins. Fast transparent migration for virtual machines. In USENIX ATC, pages 391--394, 2005.

[37]

Hoai Viet Nguyen, Luigi Lo Iacono, and Hannes Federrath. Your cache has fallen: Cache-poisoned denial-of-service attack. In CCS, 2019.

[38]

Jinsu Park, Seongbeom Park, and Woongki Baek. Copart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In EuroSys, page 10. ACM, 2019.

[39]

Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. Guarder: A tunable secure allocator. In USENIX Security, pages 117--133, 2018.

[40]

Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous mutlithreading processor. SIGPLAN Not., 35(11):234--244, November 2000.

Digital Library

[41]

H V Sorensen, D Jones, Michael Heideman, and C Burrus. Real-valued fast fourier transform algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(6):849--863, 1987.

[42]

Paul Turner, Bharata B Rao, and Nikhil Rao. Cpu bandwidth control for cfs. 2010.

[43]

Venkatanathan Varadarajan, Thawan Kooburat, Benjamin Farley, Thomas Ristenpart, and Michael M Swift. Resource-freeing attacks: improve your cloud performance (at your neighbor's expense). In CCS, pages 281--292, 2012.

[44]

Vish Viswanathan, Karthik Kumar, T Willhalm, P Lu, B Filipiak, and S Sakthivelu. Intel memory latency checker. Intel Corporation, 2013.

[45]

Yaocheng Xiang, Xiaolin Wang, Zihui Huang, Zeyu Wang, Yingwei Luo, and Zhenlin Wang. Dcaps: dynamic cache allocation with partial sharing. In EuroSys, page 13. ACM, 2018.

[46]

Cong Xu, Karthick Rajamani, Alexandre Ferreira, Wesley Felter, Juan Rubio, and Yang Li. dcat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service. In EuroSys, page 14. ACM, 2018.

[47]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ISCA, pages 607--618, New York, NY, USA, 2013. ACM.

[48]

Yunqi Zhang, George Prekas, Giovanni M. Fumarola, Marcus Fontoura, Inigo Goiri, and Ricardo Bianchini. History-based harvesting of spare cycles and storage in large-scale datacenters. In OSDI, Berkeley, CA, USA, 2016. USENIX.

[49]

Haishan Zhu and Mattan Erez. Dirigent: Enforcing qos for latency-critical tasks on shared multicore systems. In ASPLOS, pages 33--47, New York, NY, USA, 2016. ACM.

Digital Library

Cited By

Shahrad MElnikety SBianchini R(2021)Provisioning Differentiated Last-Level Cache Allocations to VMs in Public CloudsProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487006(319-334)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3487006

Recommendations

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
SELECTIVE VICTIM CACHING: A METHOD TO IMPROVE THE PERFORMANCE OF DIRECT-MAPPED CACHES
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
272
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shahrad MElnikety SBianchini R(2021)Provisioning Differentiated Last-Level Cache Allocations to VMs in Public CloudsProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487006(319-334)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3487006

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents