Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Arrakis: The Operating System Is the Control Plane

Published: 02 November 2015 Publication History

Abstract

Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2 to 5 × in latency and 9 × throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.

References

[1]
D. Abramson. 2006. Intel virtualization technology for directed I/O. Intel Technology Journal 10, 3 (2006), 179--192.
[2]
Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In ACM SIGMETRICS 2012.
[3]
Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. 1999. Resource containers: A new facility for resource management in server systems. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation.
[4]
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles.
[5]
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles.
[6]
Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. 2012. Dune: Safe user-level access to privileged CPU features. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation.
[7]
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation.
[8]
A. Bensoussan, C. T. Clingen, and R. C. Daley. 1972. The multics virtual memory: Concepts and design. Communications of the ACM 15 (1972), 308--318.
[9]
Brian N. Bershad, Stefan Savage, Przemysław Pardyak, Emin Gün Sirer, Marc E. Fiuczynski, David Becker, Craig Chambers, and Susan Eggers. 1995. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles.
[10]
Richard Black, Paul T. Barham, Austin Donnelly, and Neil Stratford. 1997. Protocol implementation in a vertically structured operating system. In Proceedings of the 22nd Annual Conference on Local Computer Networks.
[11]
Adrian M. Caulfield, Todor I. Mollov, Louis Alex Eisner, Arup De, Joel Coburn, and Steven Swanson. 2012. Providing safe, user space access to fast, solid state disks. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems.
[12]
Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles.
[13]
Compaq Computer Corp., Intel Corporation, and Microsoft Corporation. 1997. Virtual Interface Architecture Specification (version 1.0 ed.).
[14]
RDMA Consortium. 2009. Architectural Specifications for RDMA over TCP/IP. Retrieved from http://www.rdmaconsortium.org/.
[15]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM 56, 2 (Feb. 2013), 74--80.
[16]
Martin Devera. 2002. HTB Linux queuing discipline manual -- User Guide. Retrieved from http://luxik.cdi.cz/ devik/qos/htb/userg.pdf.
[17]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation.
[18]
Peter Druschel, Larry Peterson, and Bruce Davie. 1994. Experiences with a high-speed network adaptor: A software perspective. In Proceedings of the ACM SIGCOMM Conference on Communications Architectures, Protocols and Applications.
[19]
Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the 9th ACM SIGOPS/EuroSys European Conference on Computer Systems.
[20]
Fusion-IO. 2014. ioDrive2 and ioDrive2 Duo Multi Level Cell. Fusion-IO. Product Datasheet. Retrieved from http://www.fusionio.com/load/-media-/2rezss/docsLibrary/FIO_DS_ioDrive2.pdf.
[21]
Gregory R. Ganger, Dawson R. Engler, M. Frans Kaashoek, Hector M. Briceño, Russell Hunt, and Thomas Pinckney. 2002. Fast and flexible application-level networking on Exokernel systems. ACM Transactions on Computer Systems 20, 1 (Feb. 2002), 49--83.
[22]
Abel Gordon, Nadav Amit, Nadav Har'El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster, and Dan Tsafrir. 2012. ELI: Bare-metal performance for I/O virtualization. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems.
[23]
Daniel Halperin, Srikanth Kandula, Jitendra Padhye, Paramvir Bahl, and David Whetherall. 2011. Augmenting data center networks with multi-gigabit wireless links. In Proceedings of the ACM SIGCOMM Conference.
[24]
Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A new programming interface for scalable network I/O. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation.
[25]
Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. A file is not a file: Understanding the I/O behavior of Apple desktop applications. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles.
[26]
Bert Hubert. 2009. Linux Advanced Routing & Traffic Control HOWTO. Retrieved from http://www.lartc.org/howto/.
[27]
Infiniband Trade Organization. 2010. Introduction to Infiniband for End Users. Retrieved from https://cw.infinibandta.org/document/dl/7268.
[28]
Intel Corporation. 2010. Intel 82599 10 GbE Controller Datasheet. Revision 2.6. Retrieved from http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf.
[29]
Intel Corporation. 2013a. Intel Data Plane Development Kit (Intel DPDK) Programmer's Guide. Intel Corporation. Reference Number: 326003-003.
[30]
Intel Corporation 2013b. Intel RAID Controllers RS3DC080 and RS3DC040. Intel Corporation. Product Brief. http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/raid-controller-rs3dc-brief.pdf.
[31]
Intel Corporation. 2013c. Intel Virtualization Technology for Directed I/O Architecture Specification. Technical Report Order Number: D51397-006. Intel Corporation.
[32]
Intel Corporation. 2013d. NVM Express (revision 1.1a ed.). Intel Corporation. http://www.nvmexpress.org/wp-content/uploads/NVM-Express-1_1a.pdf.
[33]
EunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeongand Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation.
[34]
William K. Josephson, Lars A. Bongo, Kai Li, and David Flynn. 2010. DFS: A file system for virtualized flash storage. Transactions on Storage 6, 3, Article 14 (Sept. 2010), 14:1--14:25 pages.
[35]
Antoine Kaufmann, Simon Peter, Thomas E. Anderson, and Arvind Krishnamurthy. 2015. FlexNIC: Rethinking network DMA. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems.
[36]
P. Kutch. 2011. PCI-SIG SR-IOV primer: An introduction to SR-IOV technology. Intel Application Note 321211--002 (Jan. 2011).
[37]
I. M. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R. Fairbairns, and E. Hyden. 1996. The design and implementation of an operating system to support distributed multimedia applications. IEEE Journal on Selected Areas in Communications 14, 7 (Sept. 1996), 1280--1297.
[38]
Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. 2014. Tales of the Tail: Hardware, OS, and application-level sources of tail latency. In Proceedings of the 5th Symposium on Cloud Computing.
[39]
LSI Corporation 2010. LSISAS2308 PCI Express to 8-Port 6Gb/s SAS/SATA Controller. LSI Corporation. Product Brief. Retrieved from http://www.lsi.com/downloads/Public/SAS%20ICs/LSI_PB_SAS2308.pdf.
[40]
LSI Corporation 2014. LSISAS3008 PCI Express to 8-Port 12Gb/s SAS/SATA Controller. LSI Corporation. Product Brief. http://www.lsi.com/downloads/Public/SAS%20ICs/LSI_PB_SAS3008.pdf.
[41]
Ilias Marinos, Robert N. M. Watson, and Mark Handley. 2014. Network stack specialization for performance. In Proceedings of the ACM SIGCOMM Conference.
[42]
David Mosberger and Larry L. Peterson. 1996. Making paths explicit in the Scout operating system. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation.
[43]
Vivek S. Pai, Peter Druschel, and Willy Zwanepoel. 1999. IO-Lite: A unified I/O buffering and caching system. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation.
[44]
Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. 2012. Improving network connection locality on multicore systems. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems.
[45]
Sivasankar Radhakrishnan, Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, George Porter, and Amin Vahdat. 2014. SENIC: Scalable NIC for end-host rate limiting. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation.
[46]
Barath Raghavan, Kashi Vishwanath, Sriram Ramabhadran, Kenneth Yocum, and Alex C. Snoeren. 2007. Cloud control with distributed rate limiting. In Proceedings of the ACM SIGCOMM Conference.
[47]
Luigi Rizzo. 2012. Netmap: A novel framework for fast packet I/O. In Proceedings of the USENIX Annual Technical Conference.
[48]
Jim Roskind. 2013. Experimenting with QUIC. Retrieved from http://blog.chromium.org/2013/06/experimenting-with-quic.html.
[49]
Solarflare Communications, Inc. 2010. .Solarflare SFN5122F Dual-Port 10GbE Enterprise Server Adapter. Retrieved from http://www.solarflare.com/Content/UserFiles/Documents/Solarflare_SFN5122F_10GbE_Adapter_Brief.pdf.
[50]
Animesh Trivedi, Patrick Stuedi, Bernard Metzler, Roman Pletka, Blake G. Fitch, and Thomas R. Gross. 2013. Unified high-performance I/O: One stack to rule them all. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems.
[51]
Haris Volos, Sanketh Nalli, Sankaralingam Panneerselvam, Venkatanathan Varadarajan, Prashant Saxena, and Michael M. Swift. 2014. Aerie: Flexible file-system interfaces to storage-class memory. In Proceedings of the 9th ACM SIGOPS/EuroSys European Conference on Computer Systems.
[52]
T. von Eicken, A. Basu, V. Buch, and W. Vogels. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles.

Cited By

View all
  • (2024)Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern HardwareProceedings of the VLDB Endowment10.14778/3659437.365946217:8(2115-2122)Online publication date: 1-Apr-2024
  • (2024)The Case of Unsustainable CPU AffinityACM SIGEnergy Energy Informatics Review10.1145/3698365.36983714:3(32-38)Online publication date: 1-Jul-2024
  • (2024)TailClipper: Reducing Tail Response Time of Distributed Services Through System-Wide SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698554(398-414)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 33, Issue 4
January 2016
125 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2841315
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2015
Accepted: 01 August 2015
Received: 01 July 2015
Published in TOCS Volume 33, Issue 4

Check for updates

Author Tags

  1. I/O virtualization
  2. Kernel bypass
  3. SR-IOV

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NetApp, Google, and the National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)725
  • Downloads (Last 6 weeks)122
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern HardwareProceedings of the VLDB Endowment10.14778/3659437.365946217:8(2115-2122)Online publication date: 1-Apr-2024
  • (2024)The Case of Unsustainable CPU AffinityACM SIGEnergy Energy Informatics Review10.1145/3698365.36983714:3(32-38)Online publication date: 1-Jul-2024
  • (2024)TailClipper: Reducing Tail Response Time of Distributed Services Through System-Wide SchedulingProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698554(398-414)Online publication date: 20-Nov-2024
  • (2024)Can OS Specialization give new life to old carbon in the cloud?Proceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689158(83-90)Online publication date: 16-Sep-2024
  • (2024)Serverless End Game: Disaggregation enabling TransparencyProceedings of the 2nd Workshop on SErverless Systems, Applications and MEthodologies10.1145/3642977.3652094(9-14)Online publication date: 22-Apr-2024
  • (2024)State Disaggregation for Dynamic Scaling of Network FunctionsIEEE/ACM Transactions on Networking10.1109/TNET.2023.328256232:1(81-95)Online publication date: Feb-2024
  • (2024)A Modular Distributed Operating System Architecture for Scalable Mesh Topologies2024 International Conference on Emerging Trends in Smart Technologies (ICETST)10.1109/ICETST62952.2024.10737937(1-5)Online publication date: 10-Oct-2024
  • (2024)LibPreemptible: Enabling Fast, Adaptive, and Hardware-Assisted User-Space Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00075(922-936)Online publication date: 2-Mar-2024
  • (2024)Empowering Cloud Computing With Network Acceleration: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2024.337753126:4(2729-2768)Online publication date: Dec-2025
  • (2023)Highly Concurrent TCP Session Connection Management System on FPGA ChipMicromachines10.3390/mi1402038514:2(385)Online publication date: 3-Feb-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media