Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO56248.2022.00039acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Altocumulus: Scalable Scheduling for Nanosecond-Scale Remote Procedure Calls

Published: 18 December 2023 Publication History

Abstract

Online services in modern datacenters use Remote Procedure Calls (RPCs) to communicate between different software layers. Despite RPCs using just a few small functions, inefficient RPC handling can cause delays to propagate across the system and degrade end-to-end performance. Prior work has reduced RPC processing time to less than 1 μs, which now shifts the bottleneck to the scheduling of RPCs. Existing RPC schedulers suffer from either high overheads, inability to effectively utilize high core-count CPUs or do not adaptively fit different traffic patterns. To address these shortcomings, we present Altocumulus,1 a scalable, software-hardware co-design to schedule RPCs at nanosecond scales. Altocumulus provides a proactive scheduling scheme and low-overhead messaging mechanism on top of a decentralized user runtime. Altocumulus also offers direct access from the user space to a set of simple hardware primitives to quickly migrate long-latency RPCs. We evaluate Altocumulus with synthetic workloads and an end-to-end in-memory key-value store application under real-world traffic patterns. Altocumulus improves throughput by 1.3--24.6× under a 99th percentile latency <300 μs and reduces tail latency by up to 15.8× on 16-core systems over current state-of-the-art software and hardware schedulers. For 256-core systems, integrating Altocumulus with either a hardware-optimized NIC or commodity PCIe NIC can improve throughput by 2.8× or 2.7×, respectively, under 99th percentile latency <8.5 μs.

References

[1]
"Data Plane Development Kit. The Linux Foundation Projects." https://www.dpdk.org.
[2]
"Ethernet alliance. (2020)," https://ethernetalliance.org/technology/2020-roadmap/.
[3]
"Intel Corp. Introduction to Intel Ethernet Flow Director and Memcached Performance." http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-ethernet-flow-director.pdf.
[4]
"Marvell® octeon 10 dpu platform." https://www.marvell.com/content/dam/marvell/en/public-collateral/embedded-processors/marvell-octeon-10-dpu-platform-product-brief.pdf.
[5]
"Microsoft corp. receive side scaling." http://msdn.microsoft.com/library/windows/hardware/ff556942.aspx.
[6]
R. Achermann, A. Panwar, A. Bhattacharjee, T. Roscoe, and J. Gandhi, "Mitosis: Transparently self-replicating page-tables for large-memory machines," in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 283--300.
[7]
T. Barbette, G. P. Katsikas, G. Q. Maguire Jr, and D. Kostić, "RSS++: load and state-aware receive side scaling," in Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies (CoNEXT), 2019, pp. 318--333.
[8]
A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, "IX: A protected dataplane operating system for high throughput and low latency," in Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014, pp. 49--65.
[9]
S. Bergsma, T. Zeyl, A. Senderovich, and J. C. Beck, "Generating complex, realistic cloud workloads using recurrent neural networks," in Proceedings of the 28th Symposium on Operating Systems Principles (SOSP), 2021, pp. 376--391.
[10]
E. Castillo, L. Alvarez, M. Moreto, M. Casas, E. Vallejo, J. L. Bosque, R. Beivide, and M. Valero, "Architectural support for task dependence management with flexible software scheduling," in Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 283--295.
[11]
A. Daglis, M. Sutherland, and B. Falsafi, "RPCValet: NI-driven tail-aware balancing of μs-scale RPCs," in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 35--48.
[12]
W. J. Dally and B. P. Towles, Principles and practices of interconnection networks. Elsevier, 2004.
[13]
M. Dalton, D. Schultz, J. Adriaens, A. Arefin, A. Gupta, B. Fahs, D. Rubinstein, E. C. Zermeno, E. Rubow, J. A. Docauer, J. Alpert, J. Ai, J. Olson, K. DeCabooter, M. d. Kruijf, N. Hua, N. Lewis, N. Kasinadhuni, R. Crepaldi, S. Krishnan, S. Venkata, Y. Richter, U. Naik, and A. Vahdat, "Andromeda: Performance, isolation, and velocity at scale in cloud network virtualization," in Proceedings of the 15th Symposium on Networked Systems Design and Implementation (NSDI), 2018, pp. 373--387.
[14]
J. Dean and L. A. Barroso, "The tail at scale," Communications of the ACM, vol. 56, no. 2, pp. 74--80, 2013.
[15]
A. Farshin, T. Barbette, A. Roozbeh, G. Q. Maguire Jr, and D. Kostić, "PacketMill: toward per-core 100-Gbps networking," in Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2021, pp. 1--17.
[16]
D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha, H. Angepat, V. Bhanu, A. Caulfield, E. Chung, H. K. Chandrappa, S. Chaturmohta, M. Humphrey, J. Lavier, N. Lam, F. Liu, K. Ovtcharov, J. Padhye, G. Popuri, S. Raindel, T. Sapre, M. Shaw, G. Silva, M. Sivakumar, N. Srivastava, A. Verma, Q. Zuhair, D. Bansal, D. Burger, K. Vaid, D. A. Maltz, and A. Greenberg, "Azure accelerated networking: Smartnics in the public cloud," in Proceedings of the 15th Symposium on Networked Systems Design and Implementation (NSDI), 2018, pp. 51--66.
[17]
J. Fried, Z. Ruan, A. Ousterhout, and A. Belay, "Caladan: Mitigating interference at microsecond timescales," in Proceedings of the 14th Symposium on Operating Systems Design and Implementation (OSDI), 2020, pp. 281--297.
[18]
Y. Fu, T. M. Nguyen, and D. Wentzlaff, "Coherence domain restriction on large scale systems," in Proceedings of the 48th International Symposium on Microarchitecture (MICRO), 2015, pp. 686--698.
[19]
Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou, "An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems," in Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 3--18.
[20]
C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn, "RDMA over commodity ethernet at scale," in Proceedings of the 2016 ACM SIGCOMM Conference, 2016, pp. 202--215.
[21]
M. Hao, H. Li, M. H. Tong, C. Pakha, R. O. Suminto, C. A. Stuardo, A. A. Chien, and H. S. Gunawi, "MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface," in Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), 2017, pp. 168--183.
[22]
J. T. Humphries, N. Natu, A. Chaugule, O. Weisse, B. Rhoden, J. Don, L. Rizzo, O. Rombakh, P. Turner, and C. Kozyrakis, "ghOSt: Fast & flexible user-space delegation of linux scheduling," in Proceedings of the 28th Symposium on Operating Systems Principles (SOSP), 2021, pp. 588--604.
[23]
S. Ibanez, A. Mallery, S. Arslan, T. Jepsen, M. Shahbaz, C. Kim, and N. McKeown, "The nanoPU: A nanosecond network stack for datacenters," in Proceedings of the 15th Symposium on Operating Systems Design and Implementation (OSDI), 2021, pp. 239--256.
[24]
M. C. Jeffrey, S. Subramanian, M. Abeydeera, J. Emer, and D. Sanchez, "Data-centric execution of speculative parallel programs," in Proceedings of the 49th International Symposium on Microarchitecture (MICRO), 2016, pp. 5:1--5:13.
[25]
E. Jeong, S. Wood, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park, "mtcp: a highly scalable user-level TCP stack for multicore systems," in Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI), 2014, pp. 489--502.
[26]
K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Mazières, and C. Kozyrakis, "Shinjuku: Preemptive scheduling for μsecond-scale tail latency," in Proceedings of the 16th Symposium on Networked Systems Design and Implementation (NSDI), 2019, pp. 345--360.
[27]
A. Kalia, M. Kaminsky, and D. Andersen, "Datacenter RPCs can be general and fast," in Proceedings of the 16th Symposium on Networked Systems Design and Implementation (NSDI), 2019, pp. 1--16.
[28]
A. Kalia, M. Kaminsky, and D. G. Andersen, "Using RDMA efficiently for key-value services," in Proceedings of the 2014 ACM SIGCOMM Conference, 2014, pp. 295--306.
[29]
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a warehouse-scale computer," in Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), 2015, pp. 158--169.
[30]
R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat, "Chronos: Predictable low latency for data center applications," in Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC), 2012, pp. 1--14.
[31]
S. Karandikar, C. Leary, C. Kennelly, J. Zhao, D. Parimi, B. Nikolic, K. Asanovic, and P. Ranganathan, "A hardware accelerator for protocol buffers," in Proceedings of the 54th International Symposium on Microarchitecture (MICRO), 2021, pp. 462--478.
[32]
H. Kasture and D. Sanchez, "Ubik: Efficient cache sharing with strict QoS for latency-critical workloads," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, p. 729--742.
[33]
M. Kogias, G. Prekas, A. Ghosn, J. Fietz, and E. Bugnion, "R2P2: Making RPCs first-class datacenter citizens," in Proceedings of the 2019 Annual Technical Conference (ATC), 2019, pp. 863--880.
[34]
N. Kulkarni, G. Gonzalez-Pumariega, A. Khurana, C. A. Shoemaker, C. Delimitrou, and D. H. Albonesi, "CuttleSys: Data-driven resource management for interactive services on reconfigurable multicores," in Proceedings of the 53rd International Symposium on Microarchitecture (MICRO), 2020, pp. 650--664.
[35]
S. Kumar, C. J. Hughes, and A. Nguyen, "Carbon: architectural support for fine-grained parallelism on chip multiprocessors," in Proceedings of the 34th International Symposium on Computer Architecture (ISCA), 2007, pp. 162--173.
[36]
N. Lazarev, S. Xiang, N. Adit, Z. Zhang, and C. Delimitrou, "Dagger: efficient and fast RPCs in cloud microservices with near-memory reconfigurable NICs," in Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2021, pp. 36--51.
[37]
J. Leverich and C. Kozyrakis, "Reconciling high server utilization and sub-millisecond quality-of-service," in Proceedings of the 9th European Conference on Computer Systems (EuroSys), 2014, pp. 1--14.
[38]
S. Li, H. Lim, V. W. Lee, J. H. Ahn, A. Kalia, M. Kaminsky, D. G. Andersen, O. Seongil, S. Lee, and P. Dubey, "Architecting to achieve a billion requests per second throughput on a single key-value store server platform," in Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA), 2015, pp. 476--488.
[39]
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky, "Mica: A holistic approach to fast in-memory key-value storage," in Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI), 2014, pp. 429--444.
[40]
D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, "Heracles: Improving resource efficiency at scale," in Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), 2015, pp. 450--462.
[41]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building customized program analysis tools with dynamic instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 2005, p. 190--200.
[42]
A. Margaritov, S. Gupta, R. Gonzalez-Alberquilla, and B. Grot, "Stretch: Balancing qos and throughput for colocated server workloads on smt cores," in Proceedings of the 25th International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 15--27.
[43]
M. Marty, M. de Kruijf, J. Adriaens, C. Alfeld, S. Bauer, C. Contavalli, M. Dalton, N. Dukkipati, W. C. Evans, S. Gribble, N. Kidd, R. Kononov, G. Kumar, C. Mauer, E. Musick, L. Olson, E. Rubow, M. Ryan, K. Springborn, P. Turner, V. Valancius, X. Wang, and A. Vahdat, "Snap: A microkernel approach to host networking," in Proceedings of the 27th Symposium on Operating Systems Principles (SOSP), 2019, pp. 399--413.
[44]
S. McClure, A. Ousterhout, S. Shenker, and S. Ratnasamy, "Efficient scheduling policies for microsecond-scale tasks," in Proceedings of the 19th Symposium on Networked Systems Design and Implementation (NSDI), 2022, pp. 1--18.
[45]
B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout, "Homa: A receiver-driven low-latency transport protocol using network priorities," in Proceedings of the 2018 ACM SIGCOMM Conference, 2018, pp. 221--235.
[46]
R. Neugebauer, G. Antichi, J. F. Zazo, Y. Audzevich, S. López-Buedo, and A. W. Moore, "Understanding PCIe performance for end host networking," in Proceedings of the 2018 ACM SIGCOMM Conference, 2018, pp. 327--341.
[47]
S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "The case for rackout: Scalable data serving using rack-scale systems," in Proceedings of the 7th Symposium on Cloud Computing (SoCC), 2016, pp. 182--195.
[48]
A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Balakrishnan, "Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads," in Proceedings of the 16th Symposium on Networked Systems Design and Implementation (NSDI), 2019, pp. 361--378.
[49]
J. Ousterhout, A. Gopalan, A. Gupta, A. Kejriwal, C. Lee, B. Montazeri, D. Ongaro, S. J. Park, H. Qin, M. Rosenblum, S. Rumble, R. Stutsman, and S. Yang, "The RAMCloud storage system," ACM Transactions on Computer Systems (TOCS), vol. 33, no. 3, pp. 1--55, 2015.
[50]
S. Peter, J. Li, I. Zhang, D. R. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The operating system is the control plane," ACM Transactions on Computer Systems (TOCS), vol. 33, no. 4, pp. 1--30, 2015.
[51]
A. Pourhabibi, S. Gupta, H. Kassir, M. Sutherland, Z. Tian, M. P. Drumond, B. Falsafi, and C. Koch, "Optimus prime: Accelerating data transformation in servers," in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 1203--1216.
[52]
A. Pourhabibi, M. Sutherland, A. Daglis, and B. Falsafi, "Cerebros: Evading the rpc tax in datacenters," in Proceedings of the 54th International Symposium on Microarchitecture (MICRO), 2021, pp. 407--420.
[53]
G. Prekas, M. Kogias, and E. Bugnion, "Zygos: Achieving low tail latency for microsecond-scale networked tasks," in Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), 2017, pp. 325--341.
[54]
H. Qin, Q. Li, J. Speiser, P. Kraft, and J. Ousterhout, "Arachne: core-aware thread management," in Proceedings of the 13th Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 145--160.
[55]
A. Rucker, M. Shahbaz, T. Swamy, and K. Olukotun, "Elastic rss: Co-scheduling packets and cores using programmable nics," in Proceedings of the 3rd Asia-Pacific Workshop on Networking (APNet), 2019, pp. 71--77.
[56]
D. Sanchez and C. Kozyrakis, "ZSim: Fast and accurate microarchitectural simulation of thousand-core systems," in Proceedings of the 40th International Symposium on Computer Architecture (ISCA), 2013, pp. 475--486.
[57]
D. Sanchez, R. M. Yoo, and C. Kozyrakis, "Flexible architectural support for fine-grain scheduling," in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010, pp. 311--322.
[58]
K. Sangaiah, M. Lui, R. Kuttappa, B. Taskin, and M. Hempstead, "SnackNoC: Processing in the communication layer," in Proceedings of the 26th International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 461--473.
[59]
M. Shan and O. Khan, "Accelerating concurrent priority scheduling using adaptive in-hardware task distribution in multicores," IEEE Computer Architecture Letters, vol. 20, no. 1, pp. 17--21, 2020.
[60]
A. Sriraman and A. Dhanotia, "Accelerometer: Understanding acceleration opportunities for data center overheads at hyper-scale," in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 733--750.
[61]
M. Sutherland, S. Gupta, B. Falsafi, V. Marathe, D. Pnevmatikatos, and A. Daglis, "The NEBULA RPC-optimized architecture," in Proceedings of the 47th International Symposium on Computer Architecture (ISCA), 2020, pp. 199--212.
[62]
C. Torng, M. Wang, and C. Batten, "Asymmetry-aware work-stealing runtimes," in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA), 2016, pp. 40--52.
[63]
M. Wang, T. Ta, L. Cheng, and C. Batten, "Efficiently supporting dynamic task parallelism on heterogeneous cache-coherent systems," in Proceedings of the 47th International Symposium on Computer Architecture (ISCA), 2020, pp. 173--186.
[64]
X. Wang, S. Chen, J. Setter, and J. F. Martínez, "Swap: Effective fine-grain management of shared last-level caches with minimum hardware support," in Proceedings of the 23rd International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 121--132.
[65]
A. Wolnikowski, S. Ibanez, J. Stone, C. Kim, R. Manohar, and R. Soulé, "Zerializer: Towards zero-copy serialization," in Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS), 2021, pp. 206--212.
[66]
H. Yang, A. Breslow, J. Mars, and L. Tang, "Bubble-flux: precise online qos management for increased utilization in warehouse scale computers," in Proceedings of the 40th International Symposium on Computer Architecture (ISCA), 2013, pp. 607--618.
[67]
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes, "Cpi2: Cpu performance isolation for shared compute clusters," in Proceedings of the 8th European Conference on Computer Systems (EuroSys), 2013, pp. 379--391.

Cited By

View all
  • (2024)SmartNIC-Enabled Live Migration for Storage-Optimized VMsProceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3678015.3680487(45-52)Online publication date: 4-Sep-2024
  • (2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023

Index Terms

  1. Altocumulus: Scalable Scheduling for Nanosecond-Scale Remote Procedure Calls
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture
        October 2022
        1498 pages
        ISBN:9781665462723

        Sponsors

        Publisher

        IEEE Press

        Publication History

        Published: 18 December 2023

        Check for updates

        Author Tags

        1. remote procedure calls
        2. scheduling
        3. datacenters
        4. networked systems
        5. load balancing
        6. migration
        7. queuing theory

        Qualifiers

        • Research-article

        Conference

        MICRO '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Upcoming Conference

        MICRO '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)16
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 30 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)SmartNIC-Enabled Live Migration for Storage-Optimized VMsProceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3678015.3680487(45-52)Online publication date: 4-Sep-2024
        • (2023)NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCsProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604820(195-207)Online publication date: 10-Sep-2023

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media