research-article

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Authors: Laiping Zhao, Yushuai Cui, Yanan Yang, Xiaobo Zhou, Tie Qiu, Keqiu Li, Yungang BaoAuthors Info & Claims

ACM Transactions on Computer Systems, Volume 42, Issue 1-2

Article No.: 2, Pages 1 - 37

https://doi.org/10.1145/3630006

Published: 13 February 2024 Publication History

Abstract

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.

We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.

References

[1]

2020. https://parsec.cs.princeton.edu/

[2]

2020. Scimark: A benchmark for scientific and numerical computing. https://openbenchmarking.org/test/pts/scimark2-1.3.2

[3]

2020. The SPEC Cloud IaaS 2018 benchmark is SPEC’s second benchmark suite to measure cloud performance.https://www.spec.org/

[4]

2020. Tensorflow-Bench: A benchmark framework for TensorFlow.https://github.com/tensorflow/benchmarks

[5]

S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham. 2007. E2EProf: Automated end-to-end performance management for enterprise systems. In The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). 749–758.

Digital Library

[6]

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA) (SOSP ’03). Association for Computing Machinery, New York, NY, USA, 74–89.

Digital Library

[7]

Pradeep Ambati, Íñigo Goiri, Felipe Frujeri, Alper Gun, Ke Wang, Brian Dolan, Brian Corell, Sekhar Pasupuleti, Thomas Moscibroda, Sameh Elnikety, Marcus Fontoura, and Ricardo Bianchini. 2020. Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. USENIX Association, USA.

[8]

Javed A. Aslam and Mark Montague. 2001. Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA) (SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 276–284.

Digital Library

[9]

Javed A. Aslam and Mark H. Montague. 2001. Models for metasearch. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA, W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 275–284.

Digital Library

[10]

Paul Barham, Richard Black, Moises Goldszmidt, Rebecca Isaacs, John MacCormick, Richard Mortier, and Aleksandr Simma. 2008. Constellation: Automated Discovery of Service and Host Dependencies in Networked Systems. Technical Report MSR-TR-2008-67. 1–14 pages.

[11]

Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In Proceedings of the Sixth USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2004. 259–272.

[12]

Sean Kenneth Barker and Prashant Shenoy. 2010. Empirical evaluation of latency-sensitive application performance in the cloud. In Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems (Phoenix, Arizona, USA) (MMSys ’10). ACM, New York, NY, USA, 35–46.

Digital Library

[13]

Anne Benoit, Mourad Hakem, and Yves Robert. 2008. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8.

[14]

Anne Benoit, Mourad Hakem, and Yves Robert. 2009. Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 35, 2 (2009), 83–108.

Digital Library

[15]

P. Chen, Y. Qi, and D. Hou. 2019. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Transactions on Services Computing 12, 2 (March2019), 214–230.

[16]

Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). ACM, New York, NY, USA, 107–120.

Digital Library

[17]

Xu Chen, Ming Zhang, Z. Morley Mao, and Paramvir Bahl. 2008. Automating network application dependency discovery: Experiences, limitations, and new solutions. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (San Diego, California) (OSDI’08). USENIX Association, USA, 117–130.

[18]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 217–231.

[19]

The Internet Traffic Archive ClarkNet. 2017. http://ita.ee.lbl.gov/html/traces.html

[20]

Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A. Patterson, and Krste Asanovic. 2013. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 308–319.

[21]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (Feb.2013), 74–80.

Digital Library

[22]

DeathStarBench. 2019. https://github.com/delimitrou/DeathStarBench

[23]

Christina Delimitrou and Christos Kozyrakis. 2013. ibench: Quantifying interference for datacenter applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 23–33.

[24]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77–88.

[25]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127–144.

Digital Library

[26]

Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-efficient provisioning in shared cloud systems. SIGPLAN Not. 51, 4 (March2016), 473–488.

Digital Library

[27]

Elasticsearch. 2021. Elasticsearch: A search engine based on the Lucene library. https://lucene.apache.org/solr/

[28]

E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (Sept.2002), 375–408.

Digital Library

[29]

Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. SIGPLAN Not. 47, 4 (March2012), 37–48.

Digital Library

[30]

Rodrigo Fonseca, George Porter, Randy H. Katz, and Scott Shenker. 2007. X-trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). USENIX Association, Cambridge, MA.

[31]

Yu Gan and Christina Delimitrou. 2018. The architectural implications of cloud microservices. IEEE Computer Architecture Letters 17, 2 (July2018), 155–158.

[32]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). ACM, New York, NY, USA, 3–18.

Digital Library

[33]

Alexander N. Gorban, Lyudmila I. Pokidysheva, Elena V. Smirnova, and Tatiana A. Tyukina. 2011. Law of the minimum paradoxes. Bulletin of Mathematical Biology73 (2011), 2013–2044.

[34]

Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubramaniam. 2011. Cuanta: Quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing (Cascais, Portugal) (SOCC ’11). ACM, New York, NY, USA, 22:1–22:14.

Digital Library

[35]

Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service (Phoenix, Arizona) (IWQoS ’19). ACM, New York, NY, USA, Article 39, 10 pages.

Digital Library

[36]

HiBench. 2020. HiBench: HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.https://github.com/Intel-bigdata/HiBench

[37]

Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance isolation for commercial latency-sensitive services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 519–532.

[38]

Alexandru Iosup, Nezih Yigitbasi, and Dick Epema. 2011. On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 104–113.

Digital Library

[39]

Ravi R. Iyer. 2004. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In International Conference on Supercomputing. 257–266.

Digital Library

[40]

Bart Jacob, Paul Larson, B. Leitao, and S. A. M. M. da Silva. 2008. SystemTap: Instrumenting the Linux kernel for analyzing performance and functional problems. IBM Redbook (2008).

[41]

jaeger. 2019. https://www.jaegertracing.io/

[42]

M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In DAC Design Automation Conference 2012. 850–855.

Digital Library

[43]

Jonathan Kaldor, Jonathan Mace, Michal Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. ACM, 34–50.

Digital Library

[44]

M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. 2012. Measuring interference between live datacenter applications. In High Performance Computing, Networking, Storage and Analysis. 1–12.

[45]

Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). Association for Computing Machinery, New York, NY, USA, Article 34, 16 pages.

Digital Library

[46]

Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. In ACM SIGPLAN Notices, Vol. 49. ACM, 729–742.

[47]

Darja Krushevskaja and Mark Sandler. 2013. Understanding latency variations of black box services. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 703–714.

Digital Library

[48]

Kubernetes. 2019. https://kubernetes.io/

[49]

Wenyu Qu Kunlin Zhan Laiping Zhao, Fangshu Li and Qingman Zhang. 2021. AITurbo: Unified compute allocation for partial predictable training in commodity clusters. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’21). ACM, USA.

[50]

Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from Alibaba trace. In Proceedings of the ACM Symposium on Cloud Computing (Carlsbad, CA, USA) (SoCC ’18). ACM, New York, NY, USA, 347–360.

Digital Library

[51]

David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450–462.

Digital Library

[52]

LTTng. 2019. https://lttng.org/

[53]

Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifner, Bob Lucas, Jeremy Kepner, John McCalpin, David Bailey, and Daisuke Takahashi. [n.d.]. Introduction to the HPC challenge benchmark suite. ([n. d.]).

[54]

Jiuyue Ma, Xiufeng Sui, Ninghui Sun, Yupeng Li, Zihao Yu, Bowen Huang, Tianni Xu, Zhicheng Yao, Yun Chen, Haibin Wang, Lixin Zhang, and Yungang Bao. 2015. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD). SIGARCH Comput. Archit. News 43, 1 (March2015), 131–143.

Digital Library

[55]

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 589–603.

[56]

A. K. Maji, S. Mitra, and S. Bagchi. 2015. ICE: An integrated configuration engine for interference mitigation in cloud services. In 2015 IEEE International Conference on Autonomic Computing. 91–100.

Digital Library

[57]

Haroon Malik, Hadi Hemmati, and Ahmed E. Hassan. 2013. Automatic detection of performance deviations in the load testing of large scale systems. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 1012–1021.

Digital Library

[58]

Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. 2012. Probabilistic shared cache management (PriSM). In Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 428–439.

[59]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248–259.

Digital Library

[60]

Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (Porto Alegre, Brazil) (MICRO-44). ACM, New York, NY, USA, 248–259.

Digital Library

[61]

Peter Mattson, Christine Cheng, Gregory F. Diamos, Cody Coleman, Paulius Micikevicius, David A. Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim M. Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf training benchmark. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org

[62]

D. A. Menasce. 2002. TPC-W: A benchmark for E-commerce. IEEE Internet Computing 6 (052002), 83–87.

Digital Library

[63]

Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, 237–250.

Digital Library

[64]

Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, and Magnus Sjalander. 2020. Twig : Multi-agent task management for colocated latency-critical cloud services. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 167–179.

[65]

Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. 2013. DeepDive: Transparently identifying and managing performance interference in virtualized environments. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (San Jose, CA) (USENIX ATC’13). USENIX Association, Berkeley, CA, USA, 219–230.

[66]

Numactl. 2019. https://github.com/numactl/numactl

[67]

Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, Antti Ylä-Jääski, and Pan Hui. 2012. Exploiting hardware heterogeneity within the same instance type of Amazon EC2. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing (Boston, MA) (HotCloud’12). USENIX Association, Berkeley, CA, USA, 4–4.

Digital Library

[68]

Ioannis Papadakis, Konstantinos Nikas, Vasileios Karakostas, Georgios Goumas, and Nectarios Koziris. 2017. Improving QoS and utilisation in modern multi-core servers with dynamic cache partitioning. In Proceedings of the Joined Workshops COSH 2017 and VisorHPC 2017, Carsten Clauss, Stefan Lankes, Carsten Trinitis, and Josef Weidendorfer (Eds.). Stockholm, Sweden, 21–26.

[69]

Jinsu Park, Seongbeom Park, and Woongki Baek. 2019. CoPart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). ACM, New York, NY, USA, Article 10, 16 pages.

Digital Library

[70]

Tirthak Patel and Devesh Tiwari. 2020. CLITE : Efficient and QoS-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 193–206.

[71]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An efficient dynamic resource scheduler for deep learning clusters(EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 3, 14 pages.

[72]

Xing Pu, Ling Liu, Yiduo Mei, Sankaran Sivathanu, Younggyun Koh, and Calton Pu. 2010. Understanding performance interference of I/O workload in virtualized cloud environments. In Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD ’10). IEEE Computer Society, Washington, DC, USA, 51–58.

Digital Library

[73]

Navaneeth Rameshan, Leandro Navarro, Enric Monte, and Vladimir Vlassov. 2014. Stay-away, protecting sensitive applications from performance interference. In Proceedings of the 15th International Middleware Conference (Bordeaux, France) (Middleware ’14). ACM, New York, NY, USA, 301–312.

Digital Library

[74]

Redis. 2019. Redis: An open source, in-memory data structure store. https://redis.io

[75]

B. Sang, J. Zhan, G. Lu, H. Wang, D. Xu, L. Wang, Z. Zhang, and Z. Jia. 2012. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems 23, 6 (June2012), 1159–1167.

Digital Library

[76]

Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 1-2 (Sept.2010), 460–471.

Digital Library

[77]

Prateek Sharma, Ahmed Ali-Eldin, and Prashant Shenoy. 2019. Resource deflation: A new approach for transient resource reclamation. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). Association for Computing Machinery, New York, NY, USA, Article 33, 17 pages.

Digital Library

[78]

Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, and Brad Calder. 2003. Discovering and exploiting program phases. IEEE Micro 23, 6 (Nov.2003), 84–93.

Digital Library

[79]

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc.

[80]

S. Sivathanu, X. Pu, L. Liu, X. Dong, and Y. Mei. 2013. Performance analysis of network I/O workloads in virtualized data centers. IEEE Transactions on Services Computing 6 (012013), 48–63.

Digital Library

[81]

Solr. 2021. Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene. https://solr.apache.org/

[82]

Spark-Bench. 2020. Spark-Bench: A Benchmark Suite for Apache Spark. https://github.com/codait/spark-bench

[83]

Shekhar Srikantaiah, Mahmut Kandemir, and Qian Wang. 2009. SHARP control: Controlled shared cache management in chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 517–528.

Digital Library

[84]

Christopher Stewart and Kai Shen. 2005. Performance modeling and system management for multi-component online services. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, 71–84.

Digital Library

[85]

Fei Tang, Wanling Gao, Jianfeng Zhan, Chuanxin Lan, Xu Wen, Lei Wang, Chunjie Luo, Zheng Cao, Xingwang Xiong, Zihan Jiang, Tianshu Hao, Fanda Fan, Fan Zhang, Yunyou Huang, Jianan Chen, Mengjia Du, Rui Ren, Chen Zheng, Daoyi Zheng, Haoning Tang, Kunlin Zhan, Biao Wang, Defei Kong, Minghe Yu, Chongkang Tan, Huan Li, Xinhui Tian, Yatao Li, Junchao Shao, Zhenyu Wang, Xiaoyu Wang, Jiahui Dai, and Hainan Ye. 2021. AIBench training: Balanced industry-standard AI training benchmarking. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021, Stony Brook, NY, USA, March 28-30, 2021. 24–35.

[86]

Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R. Ganger. 2006. Stardust: Tracking activity in a distributed storage system. SIGMETRICS Perform. Eval. Rev. 34, 1 (June2006), 3–14.

Digital Library

[87]

Muhammad Tirmazi, Adam Barker, Nan Deng, Md. E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next generation. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages.

Digital Library

[88]

A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. 2005. iPerf: The TCP/UDP bandwidth measurement tool. http.dast.nlanr.net/Projects 38 (2005).

[89]

Guohui Wang and T. S. Eugene Ng. 2010. The impact of virtualization on network performance of Amazon EC2 data center. In Proceedings of the 29th Conference on Information Communications (San Diego, California, USA) (INFOCOM’10). IEEE Press, Piscataway, NJ, USA, 1163–1171.

Digital Library

[90]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 488–499.

[91]

Fei Xu, Fangming Liu, and Hai Jin. 2016. Heterogeneity and interference-aware virtual machine provisioning for predictable performance in the cloud. IEEE Trans. Comput. 65, 8 (2016), 2470–2483.

Digital Library

[92]

Fei Xu, Fangming Liu, Linghui Liu, Hai Jin, Bo Li, and Baochun Li. 2014. iAware: Making live migration of virtual machines interference-aware in the cloud. IEEE Trans. Comput. 63, 12 (Dec.2014), 3012–3025.

Digital Library

[93]

H. Xu, X. Ning, H. Zhang, J. Rhee, and G. Jiang. 2016. PInfer: Learning to infer concurrent request paths from system kernel events. In 2016 IEEE International Conference on Autonomic Computing (ICAC). 199–208.

[94]

Ran Xu, Subrata Mitra, Jason Rahman, Peter Bai, Bowen Zhou, Greg Bronevetsky, and Saurabh Bagchi. 2018. Pythia: Improving datacenter utilization via precise contention prediction for multiple co-located workloads. In Proceedings of the 19th International Middleware Conference (Rennes, France) (Middleware ’18). ACM, New York, NY, USA, 146–160.

Digital Library

[95]

Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1–14.

Digital Library

[96]

Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online QoS management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 607–618.

Digital Library

[97]

Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Elfen scheduling: Fine-grain principled borrowing from latency-critical workloads using simultaneous multithreading. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, Denver, CO, 309–322.

[98]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, USA, 2.

[99]

Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI 2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. 379–391.

Digital Library

[100]

Y. Zhang, M. A. Laurenzano, J. Mars, and L. Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 406–418.

Digital Library

[101]

Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 406–418.

[102]

Jiacheng Zhao, Huimin Cui, Jingling Xue, and Xiaobing Feng. 2016. Predicting cross-core performance interference on multicore processors with regression analysis. IEEE Trans. Parallel Distrib. Syst. 27, 5 (May2016), 1443–1456.

Digital Library

[103]

Jiacheng Zhao, Huimin Cui, Jingling Xue, Xiaobing Feng, Youliang Yan, and Wensen Yang. 2013. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Edinburgh, Scotland, UK) (PACT ’13). IEEE Press, Piscataway, NJ, USA, 201–212.

Digital Library

[104]

Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, and Minyi Guo. 2019. Themis: Predicting and reining in application-level slowdown on spatial multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 653–663.

[105]

Qin Zheng, Bharadwaj Veeravalli, and Chen-Khong Tham. 2009. On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58, 3 (2009), 380–393.

Digital Library

[106]

Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGARCH Computer Architecture News 44, 2 (2016), 33–47.

Digital Library

[107]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

[108]

Zipkin. 2019. https://zipkin.io/

Index Terms

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing

Recommendations

Autonomous learning for efficient resource utilization of dynamic VM migration
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Dynamic migration of virtual machines on a cluster of physical machines is designed to maximize resource utilization by balancing loads across the cluster. When the utilization of a physical machine is beyond a fixed threshold, the machine is deemed ...
Resource management framework for collaborative computing systems over multiple virtual machines

A resource management framework for collaborative computing systems over multiple virtual machines (CCSMVM) is presented to increase the performance of computing systems by improving the resource utilization, which has constructed a scalable computing ...
Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments

Virtualization is a key technology that powers cloud computing platforms such as Amazon EC2. Virtual machine (VM) consolidation, where multiple VMs share a physical host, has seen rapid adoption in practice, with increasingly large numbers of VMs per ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 42, Issue 1-2

May 2024

144 pages

EISSN:1557-7333

DOI:10.1145/3647985

Editors:
Sam H. Noh
Virginia Tech, USA
,
Robbert van Renesse
Cornell University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 February 2024

Online AM: 18 November 2023

Accepted: 20 October 2023

Published in TOCS Volume 42, Issue 1-2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Shandong Provincial Natural Science Foundation
National Natural Science Foundation of China
Zhejiang Lab

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
514
Total Downloads

Downloads (Last 12 months)514
Downloads (Last 6 weeks)77

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents