research-article

Draconis: Network-Accelerated Scheduling for Microsecond-Scale Workloads

Authors:

Sreeharsha Udayashankar,

Ashraf Abdel-Hadi,

Ali Mashtizadeh,

Samer Al-KiswanyAuthors Info & Claims

EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

Pages 333 - 348

https://doi.org/10.1145/3627703.3650060

Published: 22 April 2024 Publication History

Abstract

We present Draconis, a novel scheduler for workloads in the range of tens to hundreds of microseconds. Draconis challenges the popular belief that programmable switches cannot house the complex data structures, such as queues, needed to support an in-network scheduler. Using programmable switches, Draconis achieves the low scheduling tail latency and high throughput needed to support these microsecond-scale workloads on large clusters. Furthermore, Draconis supports a wide range of complex scheduling policies, including locality-aware scheduling, priority-based scheduling, and resource-based scheduling.

Draconis reduces the 99th percentile scheduling latencies by 3×-200× when compared to state-of-the-art software-based and network-accelerated schedulers, on a range of synthetic workloads. Our evaluation also demonstrates that Draconis has 52× higher throughput than server-based scheduling systems.

References

[1]

D. Meisner, C. M. Sadler, L. A. Barroso, W. Weber, and T. F. Wenisch. Power management of online data-intensive services. 2011 38th Annual International Symposium on Computer Architecture (ISCA), pages 319--330, 2011.

Digital Library

[2]

Xinhui Tian, Rui Han, Lei Wang, Gang Lu, and Jianfeng Zhan. Latency critical big data computing in finance. The Journal of Finance and Data Science, 1(1):33--41, 2015.

[3]

Ciamac Moallemi and Mehmet Saglam. OR forum---the cost of latency in high-frequency trading. Operations Research, 61(5):1070--1086, 2013.

[4]

Stephen F. Elston and Melinda J. Wilson. Big data and smart trading. https://www.risktechforum.com/media/download/61681/download.

[5]

Boming Huang, Yuxiang Huan, Li Da Xu, Lirong Zheng, and Zhuo Zou. Automated trading systems statistical and machine learning methods and hardware implementation: a survey. Enterprise Information Systems, 13(1):132--144, 2019.

[6]

Jeffrey Dean and Luiz André Barroso. The tail at scale. Commun. ACM, 56(2):74--80, 2013.

Digital Library

[7]

Ramana Rao Kompella, Kirill Levchenko, Alex C. Snoeren, and George Varghese. Every microsecond counts: Tracking fine-grain latencies with a lossy difference aggregator. SIGCOMM Comput. Commun. Rev., 39(4):255--266, aug 2009.

Digital Library

[8]

Kay Ousterhout, Aurojit Panda, Joshua Rosen, et al. The case for tiny tasks in compute clusters. Proceedings of the 14th Workshop on Hot Topics in Operating Systems, 2013.

[9]

Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, Athanasios Xygkis, and Igor Zablotchi. Microsecond consensus for microsecond applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 599--616. USENIX Association, November 2020.

[10]

Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for usecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, 2019.

[11]

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. Shenango: Achieving high cpu efficiency for latency-sensitive datacenter workloads. Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation, pages 361--377, 2019.

[12]

Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. Racksched: A microsecond-scale scheduler for rack-scale computers. the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, 2020.

[13]

Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. Firmament: Fast, centralized cluster scheduling at scale. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016.

[14]

Sol Boucher, Anuj Kalia, David G. Andersen, and Michael Kaminsky. Putting the "micro" back in microservice. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 645--650, Boston, MA, July 2018. USENIX Association.

Digital Library

[15]

Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, page 18--32, New York, NY, USA, 2013.

Digital Library

[16]

W. Chen, A. Pi, S. Wang, and X. Zhou. Characterizing scheduling delay for low-latency data analytics workloads. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 630--639, 2018.

[17]

Marios Kogias, George Prekas, Adrien Ghosn, Jonas Fietz, and Edouard Bugnion. R2p2: Making rpcs first-class datacenter citizens. 2019 USENIX Annual Technical Conference (ATC 19), 2, 2019.

[18]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 10:10, 2010.

[19]

Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. Sparrow: distributed low latency scheduling. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 69--84, 2013.

Digital Library

[20]

Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hopper: Decentralized speculation-aware cluster scheduling at scale. Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 379--392, 2015.

Digital Library

[21]

Eric Boutin, Jaliya Ekanayake, Wei Lin, et al. Apollo: Scalable and coordinated scheduling for cloud-scale computing. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, 2014.

Digital Library

[22]

Tofino world's fastest p4-programmable ethernet switch asics. Retrieved from https://www.barefootnetworks.com/products/brief-tofino/.

[23]

Mark Van der Boor, Sem C. Borst, Johan S. H. Van Leeuwaarden, and Debankur Mukherjee. Scalable load balancing in networked systems: A survey of recent advances. SIAM Review, 64(3):554--622, 2022.

Digital Library

[24]

Rishabh Iyer, Musa Unal, Marios Kogias, and George Candea. Achieving microsecond-scale tail latency efficiently with approximate optimal scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, page 466--481, New York, NY, USA, 2023. Association for Computing Machinery.

Digital Library

[25]

Adam Wierman and Bert Zwart. Is tail-optimal scheduling possible? Operations Research, 60(5):1249--1257, 2012.

Digital Library

[26]

Tofino-2 second-generation of world's fastest p4-programmable ethernet switch asics. Retrieved from https://www.barefootnetworks.com/products/brief-tofino-2/.

[27]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561--577, Carlsbad, CA, October 2018. USENIX Association.

[28]

Panagiotis D. Diamantoulakis, Vasileios M. Kapinas, and George K. Karagiannidis. Big data analytics for dynamic energy management in smart grids. Big Data Res., 2(3):94--101, 2015.

Digital Library

[29]

Dominik Scholz. A look at intel's dataplane development kit. 2014.

[30]

GitHub - UWASL/Draconis: Draconis: Network-Accelerated Scheduling for Microsecond-Scale Workloads --- github.com. https://github.com/UWASL/Draconis. [Accessed 16-02-2024].

[31]

Xin Zhe Khooi, Levente Csikor, Jialin Li, and Dinil Mon Divakaran. In-network applications: Beyond single switch pipelines. In 2021 IEEE 7th International Conference on Network Softwarization (NetSoft), pages 1--8, 2021.

[32]

Mellanox connectx 6 vpi product sheet. https://support.mellanox.com/s/productdetails/a2v50000000p8ReAAI/-connectx6-card.

[33]

Samer Al-Kiswany, Suli Yang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Nice: Network-integrated cluster-efficient storage. Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pages 29--40, 2017.

Digital Library

[34]

Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan R. K. Ports. Just say no to paxos overhead: Replacing consensus with network ordering. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 467--483, 2016.

Digital Library

[35]

Xiaozhou Li, Raghav Sethi, Michael Kaminsky, David G. Andersen, and Michael J. Freedman. Be fast, cheap and in control with SwitchKV. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 31--44, Santa Clara, CA, March 2016. USENIX Association.

Digital Library

[36]

Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. Designing distributed systems using approximate synchrony in data center networks. 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015.

[37]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 137--150, 2004.

Digital Library

[38]

Pat Bosshart, Dan Daly, Glen Gibb, et al. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review, 44(3):87--95, 2014.

Digital Library

[39]

P4. Retrieved from https://p4.org/.

[40]

Adam Belay, Andrea Bittau, Ali José Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. Dune: Safe user-level access to privileged CPU features. In Chandu Thekkath and Amin Vahdat, editors, 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, pages 335--348. USENIX Association, 2012.

[41]

Kern build problem-issue #25-project-dune/dune. 2023. https://github.com/project-dune/dune/issues/25.

[42]

Sparrow git repository. 2013. Retrieved 2023 from https://github.com/radlab/sparrow.

[43]

John Wilkes. Google clusterdata 2011 traces. GitHub. Retrieved from https://github.com/google/cluster-data.

[44]

Diana Andreea Popescu. Technical report - latency-driven performance in data centres. Doctoral dissertation, University of Cambridge, 2019.

[45]

Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. USENIX Annual Technical Conference (USENIX ATC 15), pages 499--510, 2015.

[46]

Konstantinos Karanasos, Sriram Rao, Carlo Curino, et al. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. USENIX Annual Technical Conference (USENIX ATC 15), pages 485--497, 2015.

[47]

Boduo Li, Yanlei Diao, and Prashant Shenoy. Supporting scalable analytics with latency constraints. Proc. VLDB Endow, 8(11):1166--1177, 2015.

Digital Library

[48]

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Digital Library

[49]

Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, page 374--389, New York, NY, USA, 2017. Association for Computing Machinery.

Digital Library

[50]

Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. Netpaxos: Consensus at network speed. Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research, pages 1--7, 2015.

Digital Library

[51]

Hatem Takruri, Ibrahim Kettaneh, Ahmed Alquraan, and Samer Al-Kiswany. Flair: Accelerating reads with consistency-aware network routing. 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 723--737, 2020.

[52]

Xin Jin, Xiaozhou Li, Haoyu Zhang, et al. Netcache: Balancing keyvalue stores with fast in-network caching. Proceedings of the 26th Symposium on Operating Systems Principles, pages 121--136, 2017.

Digital Library

[53]

Dan R. K. Ports and Jacob Nelson. When should the network be the computer? Proceedings of the Workshop on Hot Topics in Operating Systems, pages 209--215, 2019.

[54]

Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, Marco Canini, and Panos Kalnis. In-network computation is a dumb idea whose time has come. Proceedings of the 16th ACM Workshop on Hot Topics in Networks, pages 150--156, 2017.

Digital Library

[55]

Craig Mustard, Fabian Ruffy, Anny Gakhokidze, Ivan Beschastnikh, and Alexandra Fedorova. Jumpgate: In-network processing as a service for data analytics. 11th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 19), 2019.

[56]

Ibrahim Kettaneh, Sreeharsha Udayashankar, Ashraf Abdel-hadi, Robin Grosman, and Samer Al-Kiswany. Falcon: Low latency, network-accelerated scheduling. In Proceedings of the 3rd P4 Workshop in Europe, EuroP4'20, page 7--12, New York, NY, USA, 2020. Association for Computing Machinery.

Digital Library

[57]

Ilias Marinos, Robert N. M. Watson, and Mark Handley. Network stack specialization for performance. Proceedings of the 2014 ACM Conference on SIGCOMM, pages 175--186, 2014.

Digital Library

[58]

George Prekas, Marios Kogias, and Edouard Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. Proceedings of the 26th Symposium on Operating Systems Principles, pages 325--341, 2017.

Digital Library

[59]

Sarah McClure, Amy Ousterhout, Scott Shenker, and Sylvia Ratnasamy. Efficient scheduling policies for Microsecond-Scale tasks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1--18, Renton, WA, April 2022. USENIX Association.

[60]

Draconis: Network-Accelerated Scheduling for Micro-Scale Workloads. https://zenodo.org/records/10688915.

Recommendations

Simulation Based Job Scheduling Optimization for Batch Workloads
ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

We present a simulation based approach for scheduling jobs that are part of a batch workflow. Our objective is to minimize the makespan, defined as completion time of the last job to leave the system in a batch workflow with dependencies. The existing ...
Efficient Microsecond-scale Blind Scheduling with Tiny Quanta
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (μs scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are ...
Transactional Scheduling for Read-Dominated Workloads
OPODIS '09: Proceedings of the 13th International Conference on Principles of Distributed Systems

The transactional approach to contention management guarantees atomicity by aborting transactions that may violate consistency. A major challenge in this approach is to schedule transactions in a manner that reduces the total time to perform all ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

April 2024

1245 pages

ISBN:9798400704376

DOI:10.1145/3627703

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '24

Sponsor:

SIGOPS

EuroSys '24: Nineteenth European Conference on Computer Systems

April 22 - 25, 2024

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
181
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)13

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents