Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3488766.3488806guideproceedingsArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
research-article
Free access

Aragog: scalable runtime verification of shardable networked systems

Published: 04 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Network functions like firewalls, proxies, and NATs are instances of distributed systems that lie on the critical path for a substantial fraction of today's cloud applications. Unfortunately, validating these systems remains difficult due to their complex stateful, timed, and distributed behaviors.
    In this paper, we present the design and implementation of Aragog, a runtime verification system for distributed network functions that achieves high expressiveness, fidelity, and scalability. Given a property of interest, Aragog efficiently checks running systems for violations of the property with a scale-out architecture consisting of a collection of global verifiers and local monitors. To improve performance and reduce communication overhead, Aragog includes an array of optimizations that leverage properties of networked systems to suppress provably unnecessary system events and to shard verification over every available local and global component. We evaluate Aragog over several network functions including a NAT Gateway that powers Azure, identifying both design and implementation bugs in the process.

    References

    [1]
    Antlr. https://www.antlr.org/.
    [2]
    Apache Flink: Stateful computations over data streams. https://flink.apache.org/.
    [3]
    Apache Kafka. https://kafka.apache.org/.
    [4]
    Maglev outage. https://status.cloud.google.com/incident/cloud-networking/18013.
    [5]
    NetFilter. http://conntrack-tools.netfilter.org/.
    [6]
    A symbolic automata library. https://github.com/lorisdanto/symbolicautomata.
    [7]
    Z3. https://github.com/Z3Prover/z3.
    [8]
    Ehab Al-Shaer, Hazem Hamed, Raouf Boutaba, and Masum Hasan. Conflict classification and analysis of distributed firewall policies. IEEE journal on selected areas in communications, 23(10):2069-2084, 2005.
    [9]
    Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center TCP (DCTCP). In Proceedings of the ACM SIGCOMM 2010 conference, pages 63-74, 2010.
    [10]
    Kalev Alpernas, Roman Manevich, Aurojit Panda, Mooly Sagiv, Scott Shenker, Sharon Shoham, and Yaron Velner. Abstract interpretation of stateful networks, 2017.
    [11]
    Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
    [12]
    Loris D'Antoni and Margus Veanes. The power of symbolic automata and transducers. In Computer Aided Verification, 29th International Conference (CAV '17), July 2017.
    [13]
    Normann Decker, Martin Leucker, and Daniel Thoma. Monitoring modulo theories. In Erika Ábrahám and Klaus Havelund, editors, Tools and Algorithms for the Construction and Analysis of Systems, pages 341-356. Springer Berlin Heidelberg, 2014.
    [14]
    M. Ali Dorosty, Fathiyeh Faghih, and Ehsan Khamespanah. Decentralized runtime verification for LTL properties using global clock, 2019.
    [15]
    Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288-323, April 1988.
    [16]
    Daniel E Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. Maglev: A fast and reliable software network load balancer. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI '16), pages 523-535, 2016.
    [17]
    Adrian Francalanza, Jorge A. Pérez, and César Sánchez. Runtime Verification for Decentralised and Distributed Systems, pages 176-210. 2018.
    [18]
    Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. Exploiting a natural network effect for scalable, fine-grained clock synchronization. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI '18), pages 81-94, 2018.
    [19]
    Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. Evolve or die: High-availability design principles drawn from googles network infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 58-72, 2016.
    [20]
    Klaus Havelund, Giles Reger, Daniel Thoma, and Eugen Zălinescu. Monitoring Events that Carry Data, pages 61-102. 2018.
    [21]
    Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob Lorch, Bryan Parno, Michael Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving safety and liveness of practical distributed systems. Communications of the ACM, 60:83-92, 06 2017.
    [22]
    Information Sciences Institute. Transmission Control Protocol. RFC 793, RFC Editor, September 1981.
    [23]
    Peyman Kazemian, George Varghese, and Nick McKeown. Header space analysis: Static checking for networks. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12), pages 9-9, Berkeley, CA, USA, 2012.
    [24]
    Ahmed Khurshid, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey. Veriflow: Verifying network-wide invariants in real time. SIGCOMM Comput. Commun. Rev., 42(4):467-472, September 2012.
    [25]
    Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI '07), Cambridge, MA, April 2007.
    [26]
    Jay Kreps, Neha Narkhede, Jun Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, volume 11, pages 1-7, 2011.
    [27]
    K. Rustan M. Leino. Dafny: An automatic program verifier for functional correctness. In Edmund M. Clarke and Andrei Voronkov, editors, Logic for Programming, Artificial Intelligence, and Reasoning, pages 348-370, 2010.
    [28]
    Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging deployed distributed systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI '08), page 423-437, USA, 2008.
    [29]
    Jeffrey F. Lukman, Huan Ke, Cesar A. Stuardo, Riza O. Suminto, Daniar H. Kurniawan, Dikaimin Simon, Satria Priambada, Chen Tian, Feng Ye, Tanakorn Leesatapornwongsa, Aarti Gupta, Shan Lu, and Haryadi S. Gunawi. FlyMC: Highly scalable testing of complex interleavings in distributed systems. In Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys '19), New York, NY, USA, 2019.
    [30]
    Qingzhou Luo, Yi Zhang, Choonghwan Lee, Dongyun Jin, Patrick O'Neil Meredith, Traian Florin Şerbănuţă, and Grigore Roşu. RV-Monitor: Efficient parametric runtime verification with simultaneous properties. In Borzoo Bonakdarpour and Scott A. Smolka, editors, Runtime Verification, pages 285-300, 2014.
    [31]
    Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic causal monitoring for distributed systems. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), Denver, CO, June 2016.
    [32]
    Haohui Mai, Ahmed Khurshid, Rachit Agarwal, Matthew Caesar, P. Brighten Godfrey, and Samuel Talmadge King. Debugging the data plane with Anteater. In Proceedings of the ACM SIGCOMM 2011 Conference, pages 290-301, New York, NY, USA, 2011.
    [33]
    Tim Nelson, Nicholas DeMarinis, Timothy Adam Hoff, Rodrigo Fonseca, and Shriram Krishnamurthi. Switches are monitors too! stateful property monitoring as a switch design criterion. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets '16), page 99-105, New York, NY, USA, 2016.
    [34]
    Aurojit Panda, Ori Lahav, Katerina Argyraki, Mooly Sagiv, and Scott Shenker. Verifying reachability in networks with mutable datapaths. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17), pages 699- 718, Boston, MA, March 2017.
    [35]
    Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, and Naveen Karri. Ananta: Cloud scale load balancing. In Proceedings of the ACM SIGCOMM 2013 Conference, pages 207-218, 2013.
    [36]
    Giles Reger, Helena Cuenca Cruz, and David Rydeheard. MarQ: Monitoring at runtime with QEA. In Christel Baier and Cesare Tinelli, editors, Tools and Algorithms for the Construction and Analysis of Systems, pages 596-610, 2015.
    [37]
    Robert Ricci, Eric Eide, and CloudLab Team. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. ;login:, the magazine of USENIX & SAGE, 39(6):36-38, 2014.
    [38]
    Guangming Xing. Minimized thompson NFA. International Journal of Computer Mathematics, 81:1097 - 1106, 2004.
    [39]
    Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI '09), page 229-244, USA, 2009.
    [40]
    Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI '09), page 213-228, USA, 2009.
    [41]
    Yifei Yuan, Soo-Jin Moon, Sahil Uppal, Limin Jia, and Vyas Sekar. NetSMC: A custom symbolic model checker for stateful network verification. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI '20), pages 181-200, February 2020.
    [42]
    Arseniy Zaostrovnykh, Solal Pirelli, Rishabh Iyer, Matteo Rizzo, Luis Pedrosa, Katerina Argyraki, and George Candea. Verifying software network functions with no verification expertise. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19), page 275-290, New York, NY, USA, 2019.
    [43]
    Arseniy Zaostrovnykh, Solal Pirelli, Luis Pedrosa, Katerina Argyraki, and George Candea. A formally verified NAT. In Proceedings of the ACM SIGCOMM 2017 Conference, page 141-154, 2017.
    [44]
    Kaiyuan Zhang, Danyang Zhuo, Aditya Akella, Arvind Krishnamurthy, and Xi Wang. Automated verification of customizable middlebox properties with gravel. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI '20), pages 221-239, Santa Clara, CA, February 2020.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    OSDI'20: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation
    November 2020
    1255 pages
    ISBN:978-1-939133-19-9

    Sponsors

    • ORACLE
    • VMware
    • Google Inc.
    • Amazon
    • Microsoft

    Publisher

    USENIX Association

    United States

    Publication History

    Published: 04 November 2020

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 34
      Total Downloads
    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media