Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357223.3362736acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering

Published: 20 November 2019 Publication History

Abstract

Distributed tracing is a core component of cloud and datacenter systems, and provides visibility into their end-to-end runtime behavior. To reduce computational and storage overheads, most tracing frameworks do not keep all traces, but sample them uniformly at random. While effective at reducing overheads, uniform random sampling inevitably captures redundant, common-case execution traces, which are less useful for analysis and troubleshooting tasks. In this work we present Sifter, a general-purpose framework for biased trace sampling. Sifter captures qualitatively more diverse traces, by weighting sampling decisions towards edge-case code paths, infrequent request types, and anomalous events. Sifter does so by using the incoming stream of traces to build an unbiased low-dimensional model that approximates the system's common-case behavior. Sifter then biases sampling decisions towards traces that are poorly captured by this model. We have implemented Sifter, integrated it with several open-source tracing systems, and evaluate with traces from a range of open-source and production distributed systems. Our evaluation shows that Sifter effectively biases towards anomalous and outlier executions, is robust to noisy and heterogeneous traces, is efficient and scalable, and adapts to changes in workloads over time.

References

[1]
Apache. Kafka: A Distributed Streaming Platform. Retrieved June 2019 from https://kafka.apache.org/. (§2.2).
[2]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI '04). (§3.2, 4.3, 7.5, and 9).
[3]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb):1137--1155, 2003. (§1, 4.3, 5.1, and 9).
[4]
H. Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 1997. (§9).
[5]
A. Chanda, A. L. Cox, and W. Zwaenepoel. Whodunit: Transactional Profiling for Multi-Tier Applications. In 2nd ACM European Conference on Computer Systems (EuroSys '07). (§9).
[6]
M. Y. Chen, A. Accardi, E. Kiciman, D. A. Patterson, A. Fox, and E. A. Brewer. Path-Based Failure and Evolution Management. In 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI '04). (§4.3, 4.4, and 9).
[7]
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). (§4.2 and 9).
[8]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A Pervasive Network Tracing Framework. In 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI '07). (§1, 6, and 9).
[9]
Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B.Jackson, et al. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). (§1 and 7.1).
[10]
Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, and C. Delimitrou. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). (§4.3 and 9).
[11]
Jaeger. Jaeger: Open Source, End-to-End Distributed Tracing. Retrieved June 2019 from https://www.jaegertracing.io/. (§1, 2.1, and 6).
[12]
C. Jiang, F. Coenen, and M. Zito. A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review, 2013. (§9).
[13]
R.Johnson. Facebook's Scribe technology now open source. Retrieved August 2017 from https://www.facebook.com/note.php?note_id=32008268919. (§2.2).
[14]
J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O'Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V. Vekataraman, K. Veeraraghavan, and Y.J. Song. Canopy: An End-to-End Performance Tracing And Analysis System. In 26th ACM Symposium on Operating Systems Principles (SOSP '17). (§1, 2.1, 2.3, 3.1, 4.1, 4.2, 4.3, 4.5, 5.4, and 9).
[15]
M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-Box Problem Diagnosis in Parallel File Systems. In 8th USENIX Conference on File and Storage Technologies (FAST '10). (§9).
[16]
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7):558--565, 1978. (§4.1).
[17]
P. Las-Casas, J. Mace, D. Guedes, and R. Fonseca. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In 10th ACM Symposium on Cloud Computing (SOCC '18). (§1,3.2,3.3,4.3,7, 7.3, 7.3,7.6,and 9).
[18]
Q. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In 31st International Conference on Machine Learning (ICML '14). (§5.1, 3, 8, and 9).
[19]
J. Leavitt. End-to-End Tracing Models: Analysis and Unification. B.Sc. Thesis, Brown University, 2014. (§4.1).
[20]
J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. In USENIX Annual Technical Conference (ATC '10). (§9).
[21]
J. Mace. Revisiting End-to-End Trace Comparison with Graph Kernels. M.Sc. Project, Brown University, 2013. (§3.2, 4.3, and 9).
[22]
J. Mace and R. Fonseca. Universal Context Propagation for Distributed System Instrumentation. In 13th ACM European Conference on Computer Systems (EuroSys '18). (§4.4).
[23]
J. Mace, R. Roelke, and R. Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In 25th ACM Symposium on Operating Systems Principles (SOSP '15). (§4.1 and 9).
[24]
G. Mann, M. Sandler, D. Krushevskaja, S. Guha, and E. Even-Dar. Modeling the Parallel Execution of Black-Box Services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '11). (§4.3, 4.4, and 9).
[25]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In 27th Conference on Neural Information Processing Systems (NIPS '13). (§5.1 and 9).
[26]
K. Nagaraj, C. Killian, and J. Neville. Structured Comparative Analysis of System Logs to Diagnose Performance Problems. In 9th USENIX Conference on Networked Systems Design and Implementation (NSDI '12). (§4.2 and 9).
[27]
OpenTracing. OpenTracing. Retrieved January 2017 from http://opentracing.io/. (§1 and 2.1).
[28]
K. Ostrowski, G. Mann, and M. Sandler. Diagnosing Latency in Multi-Tier Black-Box Services. In 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS '11). (§9).
[29]
R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger. Diagnosing Performance Changes by Comparing Request Flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI '11). (§4.3, 4.4, 7.5, and 9).
[30]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. (§1 and 7.1).
[31]
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S.Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report, Google, 2010. (§1, 2.1, 2.3, 4.1, 7.3, and 9).
[32]
E. Thereska, B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek, J. Lopez, and G. R. Ganger. Stardust: Tracking Activity in a Distributed Storage System. In 2006 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '06). (§9).
[33]
H. Tian, M. Yu, and W. Wang. Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning. In 10th ACM Symposium on Cloud Computing (SOCC '18). (§9).
[34]
Twitter. Zipkin. Retrieved July 2017 from http://zipkin.io/. (§1, 2.1, and 6).
[35]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In 4th ACM Symposium on Cloud Computing (SoCC '13). (§1).
[36]
S. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels. JMLR, 99:1201--1242, 2010. (§9).
[37]
Y. Wu, A. Chen, and L. T. X. Phan. Zeno: Diagnosing Performance Problems with Temporal Provenance. In 16th USENIX Conference on Networked Systems Design and Implementation (NSDI '19). (§9).
[38]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In 22nd ACM Symposium on Operating Systems Principles (SOSP '09). (§4.2).
[39]
X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In 24th ACM Conference on Computer and Communications Security (CCS '17). (§8).
[40]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI '12). (§1).
[41]
X. Zhao, K. Rodrigues, Y. Luo, M. Stumm, D. Yuan, and Y. Zhou. Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold. In 26th ACM Symposium on Operating Systems Principles (SOSP '17). (§9).
[42]
X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm. Non-Instrusive Performance Profiling for Entire Software Stacks based on the Flow Reconstruction Principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16). (§4.2).
[43]
X. Zhao, Y. Zhang, D. Lion, M. Faizan, Y. Luo, D. Yuan, and M. Stumm. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). (§4.2).

Cited By

View all
  • (2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '19: Proceedings of the ACM Symposium on Cloud Computing
November 2019
503 pages
ISBN:9781450369732
DOI:10.1145/3357223
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '19
Sponsor:
SoCC '19: ACM Symposium on Cloud Computing
November 20 - 23, 2019
CA, Santa Cruz, USA

Acceptance Rates

SoCC '19 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)369
  • Downloads (Last 6 weeks)41
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Diagnosis-Effective Sampling of Application TracesApplied Sciences10.3390/app1413577914:13(5779)Online publication date: 2-Jul-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • (2024)SampleHST-X: A Point and Collective Anomaly-Aware Trace Sampling Pipeline with Approximate Half Space TreesJournal of Network and Systems Management10.1007/s10922-024-09818-832:3Online publication date: 16-Apr-2024
  • (2023)LatenSeerProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624787(502-519)Online publication date: 30-Oct-2023
  • (2023)STEAM: Observability-Preserving Trace SamplingProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613881(1750-1761)Online publication date: 30-Nov-2023
  • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
  • (2023)SampleHST: Efficient On-the-Fly Selection of Distributed TracesNOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS56928.2023.10154383(1-9)Online publication date: 8-May-2023
  • (2023)Search-Based Performance Testing and Analysis for Microservice-Based Digital Power Applications2023 6th International Conference on Energy, Electrical and Power Engineering (CEEPE)10.1109/CEEPE58418.2023.10165808(1522-1527)Online publication date: 12-May-2023
  • (2023)Open tracing tools: Overview and critical comparisonJournal of Systems and Software10.1016/j.jss.2023.111793204(111793)Online publication date: Oct-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media