Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132747.3132780acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open access

ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks

Published: 14 October 2017 Publication History

Abstract

This paper focuses on the efficient scheduling on multicore systems of very fine-grain networked tasks, which are the typical building block of online data-intensive applications. The explicit goal is to deliver high throughput (millions of remote procedure calls per second) for tail latency service-level objectives that are a small multiple of the task size.
We present ZYGOS, a system optimized for μs-scale, in-memory computing on multicore servers. It implements a work-conserving scheduler within a specialized operating system designed for high request rates and a large number of network connections. ZYGOS uses a combination of shared-memory data structures, multi-queue NICs, and inter-processor interrupts to rebalance work across cores.
For an aggressive service-level objective expressed at the 99th percentile, ZYGOS achieves 75% of the maximum possible load determined by a theoretical, zero-overhead model (centralized queueing with FCFS) for 10μs tasks, and 88% for 25μs tasks.
We evaluate ZYGOS with a networked version of Silo, a state-of-the-art in-memory transactional database, running TPC-C. For a service-level objective of 1000μs latency at the 99th percentile, ZYGOS can deliver a 1.63x speedup over Linux (because of its dataplane architecture) and a 1.26x speedup over IX, a state-of-the-art dataplane (because of its work-conserving scheduler).

Supplementary Material

MP4 File (zygos.mp4)

References

[1]
Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., and Paleczny, M. Workload analysis of a large-scale key-value store. In Proceedings of the 2012 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (2012), pp. 53--64.
[2]
Barroso, L. A., Clidaras, J., and Hölzle, U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013.
[3]
Barroso, L. A., Marty, M., Patterson, D., and Ranganathan, P. Attack of the killer microseconds. Commun. ACM 60, 4 (2017), 48--54.
[4]
Belay, A., Bittau, A., Mashtizadeh, A. J., Terei, D., Mazières, D., and Kozyrakis, C. Dune: Safe User-level Access to Privileged CPU Features. In Proceedings of the 10th Symposium on Operating System Design and Implementation (OSDI) (2012), pp. 335--348.
[5]
Belay, A., Prekas, G., Primorac, M., Klimovic, A., Grossman, S., Kozyrakis, C., and Bugnion, E. The IX Operating System: Combining Low Latency, High Throughput, and Efficiency in a Protected Dataplane. ACM Trans. Comput. Syst. 34, 4 (2017), 11:1--11:39.
[6]
Bronson, N., Amsden, Z., Cabrera, G., Chakka, P., Dimov, P., Ding, H., Ferris, J., Giardullo, A., Kulkarni, S., Li, H. C., Marchukov, M., Petrov, D., Puzar, L., Song, Y. J., and Venkataramani, V. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference (ATC) (2013), pp. 49--60.
[7]
10 thousand connections problem. http//:www.kegel.com/c10k.html, 1999.
[8]
Linux cfs scheduler. https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt.
[9]
Chase, D., and Lev, Y. Dynamic circular work-stealing deque. In Proceedings of the 17th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2005), pp. 21--28.
[10]
Clements, A. T., Kaashoek, M. F., Zeldovich, N., Morris, R. T., and Kohler, E. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. ACM Trans. Comput. Syst. 32, 4 (2015), 10:1--10:47.
[11]
Contreras, G., and Martonosi, M. Characterizing and improving the performance of Intel Threading Building Blocks. In Proceedings of the 2008 IEEE International Symposium on Workload Characterization (IISWC) (2008), pp. 57--66.
[12]
Dean, J., and Barroso, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74--80.
[13]
Dinan, J., Larkins, D. B., Sadayappan, P., Krishnamoorthy, S., and Nieplocha, J. Scalable work stealing. In Proceedings of the 2009 ACM/IEEE Conference on Supercomputing (SC) (2009).
[14]
Data plane development kit. http//:www.dpdk.org/.
[15]
Dragojevic, A., Narayanan, D., Castro, M., and Hodson, O. FaRM: Fast Remote Memory. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 401--414.
[16]
Duda, K. J., and Cheriton, D. R. Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose schedular. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP) (1999), pp. 261--276.
[17]
Dunkels, A. Design and Implementation of the lwIP TCP/IP Stack. Swedish Institute of Computer Science 2 (2001), 77.
[18]
Eisenbud, D. E., Yi, C., Contavalli, C., Smith, C., Kononov, R., Mann-Hielscher, E., Cilingiroglu, A., Cheyney, B., Shang, W., and Hosein, J. D. Maglev: A Fast and Reliable Software Network Load Balancer. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI) (2016), pp. 523--535.
[19]
Epollexclusive kernel patch. https://lwn.net/Articles/667087/, 2015.
[20]
Färber, F., Cha, S. K., Primsch, J., Bornhövd, C., Sigg, S., and Lehner, W. SAP HANA database: data management for modern business applications. SIGMOD Record 40, 4 (2011), 45--51.
[21]
Gaud, F., Geneves, S., Lachaize, R., Lepers, B., Mottet, F., Muller, G., and Quéma, V. Efficient Workstealing for Multi-core Event-Driven Systems. In Proceedings of the 30th IEEE International Conference on Distributed Computing Systems (ICDCS) (2010), pp. 516--525.
[22]
Gordon, A., Amit, N., Har'El, N., Ben-Yehuda, M., Landau, A., Schuster, A., and Tsafrir, D. ELI: bare-metal performance for I/O virtualization. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XVII) (2012), pp. 411--422.
[23]
gRPC. http//:www.grpc.io/.
[24]
Hanford, N., Ahuja, V., Balman, M., Farrens, M. K., Ghosal, D., Pouyoul, E., and Tierney, B. Characterizing the impact of end-system affinities on the end-to-end performance of highspeed flows. In Proceedings of the Third International Workshop on Network-Aware Data Management (2013), pp. 1:1--1:10.
[25]
Haque, M. E., Eom, Y. H., He, Y., Elnikety, S., Bianchini, R., and McKinley, K. S. Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX) (2015), pp. 161--175.
[26]
Jeong, E., Woo, S., Jamshed, M. A., Jeong, H., Ihm, S., Han, D., and Park, K. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 489--502.
[27]
Kapoor, R., Porter, G., Tewari, M., Voelker, G. M., and Vahdat, A. Chronos: predictable low latency for data center applications. In Proceedings of the 2012 ACM Symposium on Cloud Computing (SOCC) (2012), p. 9.
[28]
Kernel connection multiplexer. https://lwn.net/Articles/657999/, 2015.
[29]
Kernel connection multiplexer patch. https://lwn.net/Articles/657970/, 2015.
[30]
Kivity, A., Kamay, Y., Laor, D., Lublin, U., and Liguori, A. kvm: the Linux virtual machine monitor. In Proceedings of the Linux symposium (2007), vol. 1, pp. 225--230.
[31]
Klimovic, A., Litz, H., and Kozyrakis, C. ReFlex: Remote Flash ≈ Local Flash. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXII) (2017), pp. 345--359.
[32]
Laadan, O., Nieh, J., and Viennot, N. Structured linux kernel projects for teaching operating systems concepts. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education (SIGCSE) (2011), pp. 287--292.
[33]
Leung, K.-C., Li, V. O. K., and Yang, D. An Overview of Packet Reordering in Transmission Control Protocol (TCP): Problems, Solutions, and Challenges. IEEE Trans. Parallel Distrib. Syst. 18, 4 (2007), 522--535.
[34]
Leverich, J., and Kozyrakis, C. Reconciling high server utilization and sub-millisecond quality-of-service. In Proceedings of the 2014 EuroSys Conference (2014), pp. 4:1--4:14.
[35]
Li, J., Agrawal, K., Elnikety, S., He, Y., Lee, I.-T. A., Lu, C., and McKinley, K. S. Work stealing for interactive services to meet target latency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) (2016), pp. 14:1--14:13.
[36]
Li, J., Sharma, N. K., Ports, D. R. K., and Gribble, S. D. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. In Proceedings of the 2014 ACM Symposium on Cloud Computing (SOCC) (2014), pp. 9:1--9:14.
[37]
libevent. http://libevent.org/.
[38]
libuv. http://libuv.org/.
[39]
Lim, H., Han, D., Andersen, D. G., and Kaminsky, M. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage. In Proceedings of the 11th Symposium on Networked Systems Design and Implementation (NSDI) (2014), pp. 429--444.
[40]
Lu, Y., Xie, Q., Kliot, G., Geller, A., Larus, J. R., and Greenberg, A. G. Join-Idle-Queue: A novel load balancing algorithm for dynamically scalable web services. Perform. Eval. 68, 11 (2011), 1056--1071.
[41]
Apache lucene. https://lucene.apache.org/.
[42]
Marinos, I., Watson, R. N. M., and Handley, M. Network stack specialization for performance. In Proceedings of the ACM SIGCOMM 2014 Conference (2014), pp. 175--186.
[43]
Memcached. https://memcached.org/.
[44]
In-memory mongodb. https://docs.mongodb.com/manual/core/inmemory/.
[45]
Nanavati, M., Wires, J., and Warfield, A. Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage. In Proceedings of the 14th Symposium on Networked Systems Design and Implementation (NSDI) (2017), pp. 17--33.
[46]
Nginx thread pool usage. https://www.nginx.com/blog/thread-pools-boost-performance-9x/.
[47]
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H. C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., and Venkataramani, V. Scaling Memcache at Facebook. In Proceedings of the 10th Symposium on Networked Systems Design and Implementation (NSDI) (2013), pp. 385--398.
[48]
Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Sparrow: distributed, low latency scheduling. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP) (2013), pp. 69--84.
[49]
Pesterev, A., Strauss, J., Zeldovich, N., and Morris, R. T. Improving network connection locality on multicore systems. In Proceedings of the 2012 EuroSys Conference (2012), pp. 337--350.
[50]
Peter, S., Li, J., Zhang, I., Ports, D. R. K., Woos, D., Krishnamurthy, A., Anderson, T. E., and Roscoe, T. Arrakis: The Operating System Is the Control Plane. ACM Trans. Comput. Syst. 33, 4 (2016), 11:1--11:30.
[51]
Pf_ring. http//:www.ntop.org/products/packet-capture/pf_ring/.
[52]
Prekas, G., Belay, A., Primorac, M., Klimovic, A., Grossman, S., Kogias, M., Gütermann, B., Kozyrakis, C., and Bugnion, E. IX Open-source version 1.0 -- Deployment and Evaluation Guide. Tech. rep., EPFL Technical Report 218568, 2016.
[53]
Prekas, G., Primorac, M., Belay, A., Kozyrakis, C., and Bugnion, E. Energy proportionality and workload consolidation for latency-critical applications. In Proceedings of the 2015 ACM Symposium on Cloud Computing (SOCC) (2015), pp. 342--355.
[54]
Redis. https://redis.io/.
[55]
Rizzo, L. netmap: A Novel Framework for Fast Packet I/O. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC) (2012), pp. 101--112.
[56]
Microsoft corp. receive side scaling. http://msdn.microsoft.com/library/windows/hardware/ff556942.aspx.
[57]
Schroeder, B., Wierman, A., and Harchol-Balter, M. Open Versus Closed: A Cautionary Tale. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI) (2006).
[58]
ScillaDB Project. Seastar -- high-performance service-application framework. https://github.com/scylladb/seastar/.
[59]
Silo: Multicore in-memory storage engine. https://github.com/stephentu/silo.
[60]
Soares, L., and Stumm, M. FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. In Proceedings of the 9th Symposium on Operating System Design and Implementation (OSDI) (2010), pp. 33--46.
[61]
Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., and Helland, P. The End of an Architectural Era (It's Time for a Complete Rewrite). In Proceedings of the 33rd International Conference on Very Large DataBases (VLDB) (2007), pp. 1150--1160.
[62]
Thekkath, C. A., Nguyen, T. D., Moy, E., and Lazowska, E. D. Implementing Network Protocols at User Level. In Proceedings of the ACM SIGCOMM 1993 Conference (1993), pp. 64--73.
[63]
TPC-C Benchmark. http//:www.tpc.org/tpcc/, 2010.
[64]
Tu, S., Zheng, W., Kohler, E., Liskov, B., and Madden, S. Speedy transactions in multicore in-memory databases. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP) (2013), pp. 18--32.
[65]
Uhlig, R., Neiger, G., Rodgers, D., Santoni, A. L., Martins, F. C. M., Anderson, A. V., Bennett, S. M., Kägi, A., Leung, F. H., and Smith, L. Intel Virtualization Technology. IEEE Computer 38, 5 (2005), 48--56.
[66]
Voltdb. https://www.voltdb.com/.
[67]
Wei, X., Shi, J., Chen, Y., Chen, R., and Chen, H. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP) (2015), pp. 87--104.
[68]
What'sapp 2m connections. https://https://blog.whatsapp.com/196/1-million-is-so-2011.
[69]
Wierman, A., and Zwart, B. Is Tail-Optimal Scheduling Possible? Operations Research 60, 5 (2012), 1249--1257.
[70]
Yang, X., Blackburn, S. M., and McKinley, K. S. Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading. In Proceedings of the 2016 USENIX Annual Technical Conference (ATC) (2016), pp. 309--322.
[71]
Yasukata, K., Honda, M., Santry, D., and Eggert, L. StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs. In Proceedings of the 2016 USENIX Annual Technical Conference (ATC) (2016), pp. 43--56.
[72]
Zeldovich, N., Yip, A., Dabek, F., Morris, R., Mazières, D., and Kaashoek, M. F. Multiprocessor Support for Event-Driven Programs. In USENIX Annual Technical Conference (2003), pp. 239--252.
[73]
Zhang, H., Dong, M., and Chen, H. Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication. In Proceedings of the 14th USENIX Conference on File and Storage Technologie (FAST) (2016), pp. 167--180.
[74]
Zygos kernel. https://github.com/ix-project/zygos, 2017.

Cited By

View all
  • (2024)Ambience: an operating system for IoT microservicesJournal of Edge Computing10.55056/jec.786Online publication date: 5-Sep-2024
  • (2024)The Case of Unsustainable CPU AffinityACM SIGEnergy Energy Informatics Review10.1145/3698365.36983714:3(32-38)Online publication date: 1-Jul-2024
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
ISBN:9781450350853
DOI:10.1145/3132747
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Microsecond-scale computing
  2. Tail latency

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOSP '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)697
  • Downloads (Last 6 weeks)108
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Ambience: an operating system for IoT microservicesJournal of Edge Computing10.55056/jec.786Online publication date: 5-Sep-2024
  • (2024)The Case of Unsustainable CPU AffinityACM SIGEnergy Energy Informatics Review10.1145/3698365.36983714:3(32-38)Online publication date: 1-Jul-2024
  • (2024)Making Sense of Multi-threaded Application Performance at Scale with NonSequiturProceedings of the ACM on Programming Languages10.1145/36897938:OOPSLA2(2325-2354)Online publication date: 8-Oct-2024
  • (2024)Can OS Specialization give new life to old carbon in the cloud?Proceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689158(83-90)Online publication date: 16-Sep-2024
  • (2024)SmartNIC-Enabled Live Migration for Storage-Optimized VMsProceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3678015.3680487(45-52)Online publication date: 4-Sep-2024
  • (2024)zQoS: Unleashing full performance capabilities of NVMe SSDs while enforcing SLOs in distributed storage systemsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673156(618-628)Online publication date: 12-Aug-2024
  • (2024)CloudRIC: Open Radio Access Network (O-RAN) Virtualization with Shared Heterogeneous ComputingProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649381(558-572)Online publication date: 29-May-2024
  • (2024)Draconis: Network-Accelerated Scheduling for Microsecond-Scale WorkloadsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650060(333-348)Online publication date: 22-Apr-2024
  • (2024)Exploring the Asynchrony of Slow Memory Filesystem with EasyIOProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629586(624-640)Online publication date: 22-Apr-2024
  • (2024)Efficient Microsecond-scale Blind Scheduling with Tiny QuantaProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640381(305-319)Online publication date: 27-Apr-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media