Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3600006.3613176acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Open access

Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications

Published: 23 October 2023 Publication History

Abstract

Modern internet-scale applications suffer from cross-service inconsistencies, arising because applications combine multiple independent and mutually-oblivious datastores. The end-to-end execution flow of each user request spans many different services and datastores along the way, implicitly establishing ordering dependencies among operations at different datastores. Readers should observe this ordering and, in today's systems, they do not.
In this work, we present Antipode, a bolt-on technique for preventing cross-service consistency violations in distributed applications. It enforces cross-service consistency by propagating lineages of datastore operations both alongside end-to-end requests and within datastores. Antipode enables a novel cross-service causal consistency model, which extends existing causality models, and whose enforcement requires us to bring in a series of technical contributions to address fundamental semantic, scalability, and deployment challenges. We implemented Antipode as an application-level library, which can easily be integrated into existing applications with minimal effort, is incrementally deployable, and does not require global knowledge of all datastore operations. We apply Antipode to eight open-source and public cloud datastores and two microservice benchmark applications. Our evaluation demonstrates that Antipode is able to prevent cross-service inconsistencies with limited programming effort and less than 2% impact on end-user latency and throughput.

Supplementary Material

PDF File (p298-loff-supp.pdf)
Supplemental material.

References

[1]
Mustaque Ahamad, Gil Neiger, James E. Burns, Prince Kohli, and Phillip W. Hutto. 1995. Causal memory: definitions, implementation, and programming. Distributed Computing 9, 1 (1995), 37--49. (§3.1, 4.2, 4.2, and A).
[2]
Phillipe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik Veeraraghavan. 2015. Challenges to Adopting Stronger Consistency at Scale. In 15th Workshop on Hot Topics in Operating Systems (HotOS'15). https://www.usenix.org/conference/hotos15/workshop-program/presentation/ajoux (§1, 2.1, 2.2, and 5.1).
[3]
Deepthi Devaki Akkoorath, Alejandro Z. Tomsic, Manuel Bravo, Zhongmiao Li, Tyler Crain, Annette Bieniusa, Nuno Preguica, and Marc Shapiro. 2016. Cure: Strong Semantics Meets High Availability and Low Latency. In 36th IEEE International Conference on Distributed Computing Systems (ICDCS '16). 98 (§3.3).
[4]
Remzi Can Aksoy and Manos Kapritsos. 2019. Aegean: replication beyond the client-server model. In 27th ACM Symposium on Operating Systems Principles (SOSP '19). (§1 and 8).
[5]
Sérgio Almeida, João Leitão, and Luís Rodrigues. 2013. ChainReaction. In 8th ACM European Conference on Computer Systems (EuroSys '13). (§3.3).
[6]
Amazon Web Services. 2020. Implementing version control using Amazon DynamoDB. https://aws.amazon.com/blogs/database/implementing-version-control-using-amazon-dynamodb/ (§6.1).
[7]
Amazon Web Services. 2022. Amazon Aurora Global Database. https://aws.amazon.com/rds/aurora/global-database/ (§7.4).
[8]
Amazon Web Services. 2022. Amazon DynamoDB. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html (§6.4).
[9]
Amazon Web Services. 2022. Amazon S3. https://aws.amazon.com/s3/features/replication/ (§7.4).
[10]
Antipode. 2023. Artifacts. https://github.com/Antipode-SOSP23 (§6.4).
[11]
Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates: The new SQL server in the cloud. In ACM SIGMOD International Conference on Management of Data (SIGMOD'19). (§6.1).
[12]
Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. Coordination avoidance in database systems. VLDB Endowment 8, 3 (2014), 185--196. (§3.4 and 5.1).
[13]
Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2012. The potential dangers of causal consistency and an explicit solution. In 3rd ACM Symposium on Cloud Computing (SoCC '12). (§3.2 and 5.1).
[14]
Peter Bailis, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2013. Bolt-on Causal Consistency. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '13). (§1, 5.1, 6, and 6.4).
[15]
Philip A Bernstein, Nathan Goodman, and Vassos Hadzilacos. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley. (§8).
[16]
Kenneth P. Birman and Robbert van Renesse. 1994. Reliable Distributed Computing with the ISIS Toolkit. IEEE Computer Society Press. (§3.1).
[17]
Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues. 2015. On the use of Clocks to Enforce Consistency in the Cloud. IEEE Computer Society Technical Committee on Data Engineering 38, 1 (2015), 18--35. https://dblp.org/rec/journals/debu/BravoDZRR15.html (§3.2 and 5.1).
[18]
Manuel Bravo, Luis Rodrigues, and Peter Van Roy. 2017. Saturn: a Distributed Metadata Service for Causal Consistency. In 11th European Conference on Computer Systems (EuroSys '17). (§3.2 and 3.3).
[19]
Anupam Chanda, Alan L. Cox, and Willy Zwaenepoel. 2007. Whodunit: Transactional Profiling for Multi-tier Applications. ACM SIGOPS Operating Systems Review 41, 3 (2007), 17--30. (§4.1).
[20]
Bernadette Charron-Bost. 1991. Concerning the size of logical clocks in distributed systems. Inform. Process. Lett. 39, 1 (1991), 11--16. (§3.2 and 5.1).
[21]
David R. Cheriton and Dale Skeen. 1993. Understanding the limitations of causally and totally ordered communication. ACM SIGOPS Operating Systems Review 27, 5 (1993), 44--57. (§3.2).
[22]
Chris Richardson. 2021. Microservices.io. https://microservices.io/ (§8).
[23]
Jeremy Cloud. 2013. Decomposing Twitter: Adventures in Service-Oriented Architecture. In QConNY'13. https://www.infoq.com/presentations/twitter-soa/ (§1 and 2.1).
[24]
Adrian Cockcroft. 2014. Migrating to Cloud Native with Microservices. In GOTO Conference '14. http://gotocon.com/dl/goto-berlin-2014/slides/AdrianCockcroft_MigratingToCloudNativeWithMicroservices.pdf (§1 and 2.1).
[25]
Cockroach Labs. 2023. CockroachRB: Transaction Layer. https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer (§6.1).
[26]
Jiaqing Du, Sameh Elnikety, Amitabha Roy, and Willy Zwaenepoel. 2013. Orbe: Scalable Causal Consistency Using Dependency Matrices and Physical Clocks. In 4th ACM Symposium on Cloud Computing (SoCC '13). (§3.3).
[27]
Jiaqing Du, Călin Iorgulescu, Amitabha Roy, and Willy Zwaenepoel. 2014. GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks Jiaqing. In 5th ACM Symposium on Cloud Computing (SoCC '14). (§3.2 and 3.3).
[28]
Eventuate. 2021. Eventuate. https://eventuate.io/ (§8).
[29]
Facebook Help Community (Entry now inaccessible) Retrieved 2017-06-03. 2017. Anyone know why I can click on a post and I get the page not found? (§1).
[30]
Facebook Help Community (Entry now inaccessible) Retrieved 2017-06-03. 2017. Notification links with picture only brings to page not found.
[31]
Facebook Help Community (Entry now inaccessible) Retrieved 2017-06-03. 2017. Why am I not receiving all of my notifications on posts that I comment on?
[32]
Facebook Help Community (Entry now inaccessible) Retrieved 2017-06-03. 2017. Why when I get notifications but then not showing up on my page? (§1).
[33]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: a pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI '07). https://www.usenix.org/conference/nsdi-07/x-trace-pervasive-network-tracing-framework (§4.1).
[34]
Martin Fowler. 2015. Microservice Trade-Offs. https://martinfowler.com/articles/microservice-trade-offs.html (§1, 2.1, 3.3, and 8).
[35]
Jonas Fritzsch, Justus Bogner, Stefan Wagner, and Alfred Zimmermann. 2019. Microservices Migration in Industry: Intentions, Strategies, and Challenges. In EEE International Conference on Software Maintenance and Evolution (ICSME '19). (§8).
[36]
Yu Gan, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Yanqi Zhang, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, Christina Delimitrou, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, and Brian Ritchken. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). (§1, 2.1, and 7).
[37]
Hector Garcia-Molina and Kenneth Salem. 1987. Sagas. ACM SIGMOD Record 16, 3 (1987), 249--259. (§8).
[38]
Sanjay Ghemawat, Robert Grandl, Srdjan Petrovic, Michael Whittaker, Parveen Patel, Ivan Posva, and Amin Vahdat. 2023. Towards Modern Development of Cloud Applications. In 19th Workshop on Hot Topics in Operating SystemsJune 2023 (HotOS '23). (§3.3).
[39]
Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internetscale Systems. In 2010 USENIX Annual Technical Conference (ATC '10). https://www.usenix.org/conference/usenix-atc-10/zookeeper-wait-free-coordination-internet-scale-systems (§8).
[40]
Jonathan Kaldor, Jonathan Mace, Michal Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Vekataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In 26th ACM Symposium on Operating Systems Principles (SOSP '17). (§3.2, 5.1, and 6).
[41]
Rivka Ladin, Barbara Liskov, Liuba Shrira, and Sanjay Ghemawat. 1992. Providing high availability using lazy replication. ACM Transactions on Computer Systems 10, 4 (11 1992), 360--391. (§5.1).
[42]
Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. 2021. Data management in microservices. VLDB Endowment 14, 13 (2021), 3328--3361. (§1, 2.1, and 3.3).
[43]
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. (§3.1, 4.2, 4.2, 4.2, and 5.1).
[44]
Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguiça, and Rodrigo Rodrigues. 2012. Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). https://www.usenix.org/conference/osdi12/technical-sessions/presentation/li (§3.3).
[45]
Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. 2011. Don't settle for eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. In 23rd ACM Symposium on Operating Systems Principles (SOSP '11). (§6.1).
[46]
Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. 2013. Stronger Semantics for Low-Latency Geo-Replicated Storage. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13). https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/lloyd (§3.3).
[47]
Haonan Lu, Kaushik Veeraraghavan, Philippe Ajoux, Jim Hunt, Yee Jiun Song, Wendy Tobagus, Sanjeev Kumar, and Wyatt Lloyd. 2015. Existential consistency. In 25th ACM Symposium on Operating Systems Principles (SOSP '15). (§3.3).
[48]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In 2021 ACM Symposium on Cloud Computing (SoCC '21). (§1, 2.1, 3.1, 3.2, 3.3, and 4.1).
[49]
Jonathan Mace and Rodrigo Fonseca. 2018. Universal Context Propagation for Distributed System Instrumentation. In 13th European Conference on Computer Systems (EuroSys '18). (§6).
[50]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: Dynamic causal monitoring for distributed systems. In 25th ACM Symposium on Operating Systems Principles (SOSP '15). (§1, 3.2, 4.1, 5.1, 6, and 6.2).
[51]
Syed Akbar Mehdi, Cody Littley, Natacha Crooks, Lorenzo Alvisi, Nathan Bronson, and Wyatt Lloyd. 2017. I Can't Believe It's Not Causal! Scalable Causal Consistency with No Slowdown Cascades. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17). https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/mehdi (§3.2 and 5.1).
[52]
MongoDB. 2021. Replication Lag Causes. https://docs.mongodb.com/manual/tutorial/troubleshoot-replica-sets (§7.3).
[53]
OpenTelemetry. 2021. OpenTelemetry. https://opentelemetry.io/ (§3.2, 5.1, 6, and 6.4).
[54]
Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs, and Ben Sigelman. 2020. Distributed Tracing in Practice: Instrumenting, Analyzing, and Debugging Microservices. O'Reilly Media. https://www.oreilly.com/library/view/distributed-tracing-in/9781492056621/ (§1, 4.1, and 6).
[55]
Nuno Preguiça, Marek Zawirski, Annette Bieniusa, Sérgio Duarte, Valter Balegas, Carlos Baquero, and Marc Shapiro. 2014. SwiftCloud: Fault-Tolerant Geo-Replication Integrated all the Way to the Client Machine. In 33rd International Symposium on Reliable Distributed Systems Workshops (SRDSW '14). (§3.3).
[56]
Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, and Gregory R. Ganger. 2014. So, you want to trace your distributed system? Key design insights from years of practical experience. Technical Report. Parallel Data Laboratory - Carnegie Mellon University. https://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf (§3.2, 5.1, and 6).
[57]
Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. Comput. Surveys 22, 4 (1990), 299--319. (§8 and A).
[58]
Malte Schwarzkopf. 2015. Operating system support for warehouse-scale computing. Ph. D. Dissertation. University of Cambridge. (§1 and 2.1).
[59]
Xiao Shi, Scott Pruett, Kevin Doherty, Jinyu Han, Dmitri Petrov, Jim Carrig, John Hugg, and Nathan Bronson. 2020. FlightTracker: Consistency across read-optimized online stores at Facebook. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20). https://www.usenix.org/conference/osdi20/presentation/shi (§1, 2.1, 3.4, 6.1, and 8).
[60]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google. https://research.google/pubs/pub36356/ (§3.2, 4.1, 5.1, and 6).
[61]
Michael Stonebraker and Uǧur Çetintemel. 2005. "One Size Fits All": An Idea Whose Time Has Come and Gone. In 21st International Conference on Data Engineering (ICDE '05). (§8).
[62]
Doug Terry. 2013. Replicated data consistency explained through baseball. Commun. ACM 56, 12 (2013), 82--89. (§6.4).
[63]
Cory G. Watson. 2013. Observability at Twitter. https://blog.twitter.com/engineering/en_us/a/2013/observability-at-twitter.html (§3.2, 5.1, and 6).
[64]
Irene Zhang, Niel Lebeck, Ariadna Norberg, Pedro Fonseca, Brandon Holt, Raymond Cheng, Arvind Krishnamurthy, and Henry M Levy. 2016. Diamond: Automating Data Management and Storage for Wide-area, Reactive Applications. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/zhang-irene (§8).
[65]
Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal, Timothy Sherwood, and Milind Chabbi. 2022. CRISP: Critical Path Analysis of Large-Scale Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). https://www.usenix.org/conference/atc22/presentation/zhang-zhizhou (§1 and 2.1).
[66]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Transactions on Software Engineering 22, 4 (2021), 243--260. (§1, 7, 7.1, and 7.4).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles
October 2023
802 pages
ISBN:9798400702297
DOI:10.1145/3600006
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

  • USENIX

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2023

Check for updates

Badges

Qualifiers

  • Research-article

Funding Sources

  • FCT: Fundação para a Ciência e a Tecnologia

Conference

SOSP '23
Sponsor:

Acceptance Rates

SOSP '23 Paper Acceptance Rate 43 of 232 submissions, 19%;
Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 1,423
    Total Downloads
  • Downloads (Last 12 months)1,423
  • Downloads (Last 6 weeks)92
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media