Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

Published: 13 February 2024 Publication History

Abstract

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat multi-component LCs as monolithic applications and treat BEs as “second-class citizens” when allocating resources to them. Neglecting the inconsistent interference tolerance abilities of LC components and the inconsistent preemption loss of BE workloads can result in missed co-location opportunities for higher throughput.
We present Rhythm, a co-location controller that deploys workloads and reclaims resources rhythmically for maximizing the system throughput while guaranteeing LC service’s tail latency requirement. The key idea is to differentiate the BE throughput launched with each LC component, that is, components with higher interference tolerance can be deployed together with more BE jobs. It also assigns different reclamation priority values to BEs by evaluating their preemption losses into a multi-level reclamation queue. We implement and evaluate Rhythm using workloads in the form of containerized processes and microservices. Experimental results show that it can improve the system throughput by 47.3%, CPU utilization by 38.6%, and memory bandwidth utilization by 45.4% while guaranteeing the tail latency requirement.

References

[1]
2020. https://parsec.cs.princeton.edu/
[2]
2020. Scimark: A benchmark for scientific and numerical computing. https://openbenchmarking.org/test/pts/scimark2-1.3.2
[3]
2020. The SPEC Cloud IaaS 2018 benchmark is SPEC’s second benchmark suite to measure cloud performance.https://www.spec.org/
[4]
2020. Tensorflow-Bench: A benchmark framework for TensorFlow.https://github.com/tensorflow/benchmarks
[5]
S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham. 2007. E2EProf: Automated end-to-end performance management for enterprise systems. In The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). 749–758.
[6]
Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (Bolton Landing, NY, USA) (SOSP ’03). Association for Computing Machinery, New York, NY, USA, 74–89.
[7]
Pradeep Ambati, Íñigo Goiri, Felipe Frujeri, Alper Gun, Ke Wang, Brian Dolan, Brian Corell, Sekhar Pasupuleti, Thomas Moscibroda, Sameh Elnikety, Marcus Fontoura, and Ricardo Bianchini. 2020. Providing SLOs for Resource-Harvesting VMs in Cloud Platforms. USENIX Association, USA.
[8]
Javed A. Aslam and Mark Montague. 2001. Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA) (SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 276–284.
[9]
Javed A. Aslam and Mark H. Montague. 2001. Models for metasearch. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA, W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 275–284.
[10]
Paul Barham, Richard Black, Moises Goldszmidt, Rebecca Isaacs, John MacCormick, Richard Mortier, and Aleksandr Simma. 2008. Constellation: Automated Discovery of Service and Host Dependencies in Networked Systems. Technical Report MSR-TR-2008-67. 1–14 pages.
[11]
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In Proceedings of the Sixth USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2004. 259–272.
[12]
Sean Kenneth Barker and Prashant Shenoy. 2010. Empirical evaluation of latency-sensitive application performance in the cloud. In Proceedings of the First Annual ACM SIGMM Conference on Multimedia Systems (Phoenix, Arizona, USA) (MMSys ’10). ACM, New York, NY, USA, 35–46.
[13]
Anne Benoit, Mourad Hakem, and Yves Robert. 2008. Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8.
[14]
Anne Benoit, Mourad Hakem, and Yves Robert. 2009. Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems. Parallel Comput. 35, 2 (2009), 83–108.
[15]
P. Chen, Y. Qi, and D. Hou. 2019. CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment. IEEE Transactions on Services Computing 12, 2 (March2019), 214–230.
[16]
Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). ACM, New York, NY, USA, 107–120.
[17]
Xu Chen, Ming Zhang, Z. Morley Mao, and Paramvir Bahl. 2008. Automating network application dependency discovery: Experiences, limitations, and new solutions. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (San Diego, California) (OSDI’08). USENIX Association, USA, 117–130.
[18]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 217–231.
[19]
The Internet Traffic Archive ClarkNet. 2017. http://ita.ee.lbl.gov/html/traces.html
[20]
Henry Cook, Miquel Moreto, Sarah Bird, Khanh Dao, David A. Patterson, and Krste Asanovic. 2013. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 308–319.
[21]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (Feb.2013), 74–80.
[22]
DeathStarBench. 2019. https://github.com/delimitrou/DeathStarBench
[23]
Christina Delimitrou and Christos Kozyrakis. 2013. ibench: Quantifying interference for datacenter applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 23–33.
[24]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices, Vol. 48. ACM, 77–88.
[25]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127–144.
[26]
Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-efficient provisioning in shared cloud systems. SIGPLAN Not. 51, 4 (March2016), 473–488.
[27]
Elasticsearch. 2021. Elasticsearch: A search engine based on the Lucene library. https://lucene.apache.org/solr/
[28]
E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (Sept.2002), 375–408.
[29]
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. SIGPLAN Not. 47, 4 (March2012), 37–48.
[30]
Rodrigo Fonseca, George Porter, Randy H. Katz, and Scott Shenker. 2007. X-trace: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). USENIX Association, Cambridge, MA.
[31]
Yu Gan and Christina Delimitrou. 2018. The architectural implications of cloud microservices. IEEE Computer Architecture Letters 17, 2 (July2018), 155–158.
[32]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS ’19). ACM, New York, NY, USA, 3–18.
[33]
Alexander N. Gorban, Lyudmila I. Pokidysheva, Elena V. Smirnova, and Tatiana A. Tyukina. 2011. Law of the minimum paradoxes. Bulletin of Mathematical Biology73 (2011), 2013–2044.
[34]
Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubramaniam. 2011. Cuanta: Quantifying effects of shared on-chip resource interference for consolidated virtual machines. In Proceedings of the 2nd ACM Symposium on Cloud Computing (Cascais, Portugal) (SOCC ’11). ACM, New York, NY, USA, 22:1–22:14.
[35]
Jing Guo, Zihao Chang, Sa Wang, Haiyang Ding, Yihui Feng, Liang Mao, and Yungang Bao. 2019. Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces. In Proceedings of the International Symposium on Quality of Service (Phoenix, Arizona) (IWQoS ’19). ACM, New York, NY, USA, Article 39, 10 pages.
[36]
HiBench. 2020. HiBench: HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.https://github.com/Intel-bigdata/HiBench
[37]
Calin Iorgulescu, Reza Azimi, Youngjin Kwon, Sameh Elnikety, Manoj Syamala, Vivek Narasayya, Herodotos Herodotou, Paulo Tomita, Alex Chen, Jack Zhang, and Junhua Wang. 2018. PerfIso: Performance isolation for commercial latency-sensitive services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 519–532.
[38]
Alexandru Iosup, Nezih Yigitbasi, and Dick Epema. 2011. On the performance variability of production cloud services. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 104–113.
[39]
Ravi R. Iyer. 2004. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In International Conference on Supercomputing. 257–266.
[40]
Bart Jacob, Paul Larson, B. Leitao, and S. A. M. M. da Silva. 2008. SystemTap: Instrumenting the Linux kernel for analyzing performance and functional problems. IBM Redbook (2008).
[42]
M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In DAC Design Automation Conference 2012. 850–855.
[43]
Jonathan Kaldor, Jonathan Mace, Michal Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017. ACM, 34–50.
[44]
M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. 2012. Measuring interference between live datacenter applications. In High Performance Computing, Networking, Storage and Analysis. 1–12.
[45]
Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). Association for Computing Machinery, New York, NY, USA, Article 34, 16 pages.
[46]
Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads. In ACM SIGPLAN Notices, Vol. 49. ACM, 729–742.
[47]
Darja Krushevskaja and Mark Sandler. 2013. Understanding latency variations of black box services. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 703–714.
[49]
Wenyu Qu Kunlin Zhan Laiping Zhao, Fangshu Li and Qingman Zhang. 2021. AITurbo: Unified compute allocation for partial predictable training in commodity clusters. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’21). ACM, USA.
[50]
Qixiao Liu and Zhibin Yu. 2018. The elasticity and plasticity in semi-containerized co-locating cloud workload: A view from Alibaba trace. In Proceedings of the ACM Symposium on Cloud Computing (Carlsbad, CA, USA) (SoCC ’18). ACM, New York, NY, USA, 347–360.
[51]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 450–462.
[53]
Piotr Luszczek, Jack J. Dongarra, David Koester, Rolf Rabenseifner, Bob Lucas, Jeremy Kepner, John McCalpin, David Bailey, and Daisuke Takahashi. [n.d.]. Introduction to the HPC challenge benchmark suite. ([n. d.]).
[54]
Jiuyue Ma, Xiufeng Sui, Ninghui Sun, Yupeng Li, Zihao Yu, Bowen Huang, Tianni Xu, Zhicheng Yao, Yun Chen, Haibin Wang, Lixin Zhang, and Yungang Bao. 2015. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD). SIGARCH Comput. Archit. News 43, 1 (March2015), 131–143.
[55]
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. 2015. Retro: Targeted resource management in multi-tenant distributed systems. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 589–603.
[56]
A. K. Maji, S. Mitra, and S. Bagchi. 2015. ICE: An integrated configuration engine for interference mitigation in cloud services. In 2015 IEEE International Conference on Autonomic Computing. 91–100.
[57]
Haroon Malik, Hadi Hemmati, and Ahmed E. Hassan. 2013. Automatic detection of performance deviations in the load testing of large scale systems. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 1012–1021.
[58]
Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. 2012. Probabilistic shared cache management (PriSM). In Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 428–439.
[59]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 248–259.
[60]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. 2011. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (Porto Alegre, Brazil) (MICRO-44). ACM, New York, NY, USA, 248–259.
[61]
Peter Mattson, Christine Cheng, Gregory F. Diamos, Cody Coleman, Paulius Micikevicius, David A. Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim M. Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf training benchmark. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org
[62]
D. A. Menasce. 2002. TPC-W: A benchmark for E-commerce. IEEE Internet Computing 6 (052002), 83–87.
[63]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Managing performance interference effects for QoS-aware clouds. In Proceedings of the 5th European Conference on Computer Systems. ACM, 237–250.
[64]
Rajiv Nishtala, Vinicius Petrucci, Paul Carpenter, and Magnus Sjalander. 2020. Twig : Multi-agent task management for colocated latency-critical cloud services. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 167–179.
[65]
Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. 2013. DeepDive: Transparently identifying and managing performance interference in virtualized environments. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (San Jose, CA) (USENIX ATC’13). USENIX Association, Berkeley, CA, USA, 219–230.
[66]
Numactl. 2019. https://github.com/numactl/numactl
[67]
Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, Antti Ylä-Jääski, and Pan Hui. 2012. Exploiting hardware heterogeneity within the same instance type of Amazon EC2. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing (Boston, MA) (HotCloud’12). USENIX Association, Berkeley, CA, USA, 4–4.
[68]
Ioannis Papadakis, Konstantinos Nikas, Vasileios Karakostas, Georgios Goumas, and Nectarios Koziris. 2017. Improving QoS and utilisation in modern multi-core servers with dynamic cache partitioning. In Proceedings of the Joined Workshops COSH 2017 and VisorHPC 2017, Carsten Clauss, Stefan Lankes, Carsten Trinitis, and Josef Weidendorfer (Eds.). Stockholm, Sweden, 21–26.
[69]
Jinsu Park, Seongbeom Park, and Woongki Baek. 2019. CoPart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). ACM, New York, NY, USA, Article 10, 16 pages.
[70]
Tirthak Patel and Devesh Tiwari. 2020. CLITE : Efficient and QoS-aware co-location of multiple latency-critical jobs for warehouse scale computers. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 193–206.
[71]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: An efficient dynamic resource scheduler for deep learning clusters(EuroSys ’18). Association for Computing Machinery, New York, NY, USA, Article 3, 14 pages.
[72]
Xing Pu, Ling Liu, Yiduo Mei, Sankaran Sivathanu, Younggyun Koh, and Calton Pu. 2010. Understanding performance interference of I/O workload in virtualized cloud environments. In Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD ’10). IEEE Computer Society, Washington, DC, USA, 51–58.
[73]
Navaneeth Rameshan, Leandro Navarro, Enric Monte, and Vladimir Vlassov. 2014. Stay-away, protecting sensitive applications from performance interference. In Proceedings of the 15th International Middleware Conference (Bordeaux, France) (Middleware ’14). ACM, New York, NY, USA, 301–312.
[74]
Redis. 2019. Redis: An open source, in-memory data structure store. https://redis.io
[75]
B. Sang, J. Zhan, G. Lu, H. Wang, D. Xu, L. Wang, Z. Zhang, and Z. Jia. 2012. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems 23, 6 (June2012), 1159–1167.
[76]
Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime measurements in the cloud: Observing, analyzing, and reducing variance. Proc. VLDB Endow. 3, 1-2 (Sept.2010), 460–471.
[77]
Prateek Sharma, Ahmed Ali-Eldin, and Prashant Shenoy. 2019. Resource deflation: A new approach for transient resource reclamation. In Proceedings of the Fourteenth EuroSys Conference 2019 (Dresden, Germany) (EuroSys ’19). Association for Computing Machinery, New York, NY, USA, Article 33, 17 pages.
[78]
Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, and Brad Calder. 2003. Discovering and exploiting program phases. IEEE Micro 23, 6 (Nov.2003), 84–93.
[79]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc.
[80]
S. Sivathanu, X. Pu, L. Liu, X. Dong, and Y. Mei. 2013. Performance analysis of network I/O workloads in virtualized data centers. IEEE Transactions on Services Computing 6 (012013), 48–63.
[81]
Solr. 2021. Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene. https://solr.apache.org/
[82]
Spark-Bench. 2020. Spark-Bench: A Benchmark Suite for Apache Spark. https://github.com/codait/spark-bench
[83]
Shekhar Srikantaiah, Mahmut Kandemir, and Qian Wang. 2009. SHARP control: Controlled shared cache management in chip multiprocessors. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 517–528.
[84]
Christopher Stewart and Kai Shen. 2005. Performance modeling and system management for multi-component online services. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2. USENIX Association, 71–84.
[85]
Fei Tang, Wanling Gao, Jianfeng Zhan, Chuanxin Lan, Xu Wen, Lei Wang, Chunjie Luo, Zheng Cao, Xingwang Xiong, Zihan Jiang, Tianshu Hao, Fanda Fan, Fan Zhang, Yunyou Huang, Jianan Chen, Mengjia Du, Rui Ren, Chen Zheng, Daoyi Zheng, Haoning Tang, Kunlin Zhan, Biao Wang, Defei Kong, Minghe Yu, Chongkang Tan, Huan Li, Xinhui Tian, Yatao Li, Junchao Shao, Zhenyu Wang, Xiaoyu Wang, Jiahui Dai, and Hainan Ye. 2021. AIBench training: Balanced industry-standard AI training benchmarking. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021, Stony Brook, NY, USA, March 28-30, 2021. 24–35.
[86]
Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R. Ganger. 2006. Stardust: Tracking activity in a distributed storage system. SIGMETRICS Perform. Eval. Rev. 34, 1 (June2006), 3–14.
[87]
Muhammad Tirmazi, Adam Barker, Nan Deng, Md. E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next generation. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 14 pages.
[88]
A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. 2005. iPerf: The TCP/UDP bandwidth measurement tool. http.dast.nlanr.net/Projects 38 (2005).
[89]
Guohui Wang and T. S. Eugene Ng. 2010. The impact of virtualization on network performance of Amazon EC2 data center. In Proceedings of the 29th Conference on Information Communications (San Diego, California, USA) (INFOCOM’10). IEEE Press, Piscataway, NJ, USA, 1163–1171.
[90]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, and Bizhu Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 488–499.
[91]
Fei Xu, Fangming Liu, and Hai Jin. 2016. Heterogeneity and interference-aware virtual machine provisioning for predictable performance in the cloud. IEEE Trans. Comput. 65, 8 (2016), 2470–2483.
[92]
Fei Xu, Fangming Liu, Linghui Liu, Hai Jin, Bo Li, and Baochun Li. 2014. iAware: Making live migration of virtual machines interference-aware in the cloud. IEEE Trans. Comput. 63, 12 (Dec.2014), 3012–3025.
[93]
H. Xu, X. Ning, H. Zhang, J. Rhee, and G. Jiang. 2016. PInfer: Learning to infer concurrent request paths from system kernel events. In 2016 IEEE International Conference on Autonomic Computing (ICAC). 199–208.
[94]
Ran Xu, Subrata Mitra, Jason Rahman, Peter Bai, Bowen Zhou, Greg Bronevetsky, and Saurabh Bagchi. 2018. Pythia: Improving datacenter utilization via precise contention prediction for multiple co-located workloads. In Proceedings of the 19th International Middleware Conference (Rennes, France) (Middleware ’18). ACM, New York, NY, USA, 146–160.
[95]
Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1–14.
[96]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. 2013. Bubble-flux: Precise online QoS management for increased utilization in warehouse scale computers. ACM SIGARCH Computer Architecture News 41, 3 (2013), 607–618.
[97]
Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Elfen scheduling: Fine-grain principled borrowing from latency-critical workloads using simultaneous multithreading. In 2016 USENIX Annual Technical Conference (USENIX ATC 16). USENIX Association, Denver, CO, 309–322.
[98]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, USA, 2.
[99]
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. 2013. CPI 2: CPU performance isolation for shared compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems. 379–391.
[100]
Y. Zhang, M. A. Laurenzano, J. Mars, and L. Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 406–418.
[101]
Yunqi Zhang, Michael A. Laurenzano, Jason Mars, and Lingjia Tang. 2014. SMiTe: Precise QoS prediction on real-system SMT processors to improve utilization in warehouse scale computers. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on. IEEE, 406–418.
[102]
Jiacheng Zhao, Huimin Cui, Jingling Xue, and Xiaobing Feng. 2016. Predicting cross-core performance interference on multicore processors with regression analysis. IEEE Trans. Parallel Distrib. Syst. 27, 5 (May2016), 1443–1456.
[103]
Jiacheng Zhao, Huimin Cui, Jingling Xue, Xiaobing Feng, Youliang Yan, and Wensen Yang. 2013. An empirical model for predicting cross-core performance interference on multicore processors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (Edinburgh, Scotland, UK) (PACT ’13). IEEE Press, Piscataway, NJ, USA, 201–212.
[104]
Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, and Minyi Guo. 2019. Themis: Predicting and reining in application-level slowdown on spatial multitasking GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 653–663.
[105]
Qin Zheng, Bharadwaj Veeravalli, and Chen-Khong Tham. 2009. On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58, 3 (2009), 380–393.
[106]
Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for latency-critical tasks on shared multicore systems. ACM SIGARCH Computer Architecture News 44, 2 (2016), 33–47.
[107]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.

Index Terms

  1. Component-distinguishable Co-location and Resource Reclamation for High-throughput Computing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 42, Issue 1-2
    May 2024
    144 pages
    EISSN:1557-7333
    DOI:10.1145/3647985
    • Editors:
    • Sam H. Noh,
    • Robbert van Renesse
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 February 2024
    Online AM: 18 November 2023
    Accepted: 20 October 2023
    Published in TOCS Volume 42, Issue 1-2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Datacenters
    2. resource utilization
    3. tail latency
    4. co-locating

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • Shandong Provincial Natural Science Foundation
    • National Natural Science Foundation of China
    • Zhejiang Lab

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 514
      Total Downloads
    • Downloads (Last 12 months)514
    • Downloads (Last 6 weeks)77
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media