Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Netflix
Instance Performance Analysis
Requirements
Brendan Gregg
Senior Performance Architect
Performance Engineering Team
bgregg@netflix.com @brendangregg
Jun	
  2015	
  
Monitoring companies are selling
faster horses
I want to buy a car
Server/Instance Analysis Potential
In the last 10 years…
•  More Linux
•  More Linux metrics
•  Better visualizations
•  Containers
Conditions ripe for innovation: where is our Henry Ford?
This Talk
•  Instance analysis: system resources, kernel, processes
–  For customers: what you can ask for
–  For vendors: our desirables & requirements
–  What we are building (and open sourcing) at Netflix to
modernize instance performance analysis (Vector, …)

Recommended for you

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

1. GPU support in Spark can accelerate analytics workloads through automatically generating CUDA code from Spark Java code or integrating Spark with GPU-enabled libraries and applications. 2. Production deployments face challenges in identifying GPU vs CPU execution, data preparation for GPU, and low resource utilization. Scheduling must handle mixed GPU and CPU workloads across non-identical hosts to avoid overload and improve utilization. 3. IBM Conductor with Spark provides solutions through fine-grained scheduling that recognizes GPU tasks, prioritizes and allocates resources independently, and allows adaptive scheduling between CPU and GPU. This improves time to results through better resource utilization.

#apachespark #sparksummit
Apache Spark
Apache SparkApache Spark
Apache Spark

Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.

big dataapache sparkreal-time processing
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...

Spark is by its nature very fault tolerant. However, faults, and application failures, can and do happen, in production at scale. In this talk, we’ll discuss the nuts and bolts of fault tolerance in Spark. We will begin with a brief overview of the sorts of fault tolerance offered, and lead into a deep dive of the internals of fault tolerance. This will include a discussion of Spark on YARN, scheduling, and resource allocation. We will then spend some time on a case study and discussing some tools used to find and verify fault tolerance issues. Our case study comes from a customer who experienced an application outage that was root caused to a scheduler bug. We discuss the analysis we did to reach this conclusion and the work that we did to reproduce it locally. We highlight some of the techniques used to simulate faults and find bugs. At the end, we’ll discuss some future directions for fault tolerance improvements in Spark, such as scheduler and checkpointing changes.

spark summit eastapache spark
•  Over 60M subscribers
•  FreeBSD CDN for content delivery
•  Massive AWS EC2 Linux cloud
•  Many monitoring/analysis tools
•  Awesome place to work
Agenda
1.  Desirables
2.  Undesirables
3.  Requirements
4.  Methodologies
5.  Our Tools
1. Desirables
Line Graphs

Recommended for you

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service

The talk I gave a while back on the work we did at Yahoo to make Apache Storm a secure multi-tenant hosted service.

GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python

GPU Computing With Apache Spark And Python - Python is a popular language for data science and analytics due to its large ecosystem of libraries and ease of use, but it is slow for number crunching tasks. GPU computing is a way to accelerate Python workloads. - This presentation demonstrates using GPUs with Apache Spark and Python through libraries like Accelerate, which provides drop-in GPU-accelerated functions, and Numba, which can compile Python functions to run on GPUs. - As an example, the task of image registration, which involves computationally expensive 2D FFTs, is accelerated using these GPU libraries within a PySpark job, achieving a 2-4x speedup over CPU-only versions

#apachespark#sparksummit
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup

This document discusses Typesafe's Reactive Platform and Apache Spark. It describes Typesafe's Fast Data strategy of using a microservices architecture with Spark, Kafka, HDFS and databases. It outlines contributions Typesafe has made to Spark, including backpressure support, dynamic resource allocation in Mesos, and integration tests. The document also discusses Typesafe's customer support and roadmap, including plans to introduce Kerberos security and evaluate Tachyon.

Historical Data
Summary Statistics
Histograms
…	
  or	
  a	
  density	
  plot	
  
Heat Maps

Recommended for you

Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...

Learn how to make your services more resilient and available by embracing principles of isolation and redundancy. See details of 2 projects - Isthmus and Active/Active to learn how Netflix architects for availability in multi-regional environment.

active-activenetflixossisthmus
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...

By Rajiv Kurian, software engineer at SignalFx. At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include: * Write very simple single threaded code, instead of complex algorithms * Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms * Separate the data plane from the control plane, instead of slowing data for control * Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation

scalearchitecturehpc
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.

http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/ http://blog.ashansa.org/2016/02/stream-processing-is-becoming-crucial.html Batch Processing. https://github.com/karamel-lab/batch-processing-comparison Stream Processing. https://github.com/karamel-lab/stream-processing-comparison

sparkcloudbatch.streaming
Monitorama 2015 Netflix Instance Analysis
Frequency Trails
Waterfall Charts
Directed Graphs

Recommended for you

Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...

At CERN, the biggest physics laboratory in the world, large volumes of data are generated every hour, it implies serious challenges to store and process all this data. An important part of this responsibility comes to the database group which not only provides services for RDBMS but also scalable systems as Hadoop, Spark and HBase. Since databases are critical, they need to be monitored, for that we have built a highly scalable, secure and central repository that stores consolidated audit data and listener, alert and OS log events generated by the databases. This central platform is used for reporting, alerting and security policy management. The database group want to further exploit the information available in this central repository to build intrusion detection system to enhance the security of the database infrastructure. In addition, build pattern detection models to flush out anomalies using the monitoring and performance metrics available in the central repository. Finally, this platform also helps us for capacity planning of the database deployment. The audience would get first-hand experience of how to build real time Apache Spark application that is deployed in production. They would hear the challenges faced and decisions taken while developing the application and troubleshooting Apache Spark and Spark streaming application in production.

apache sparkspark summit
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm

The document discusses different technologies for real-time data collection and analysis, including Kafka for collecting and distributing streaming data, Storm for distributed real-time computation, and using PHP and FastCGI to parse real-time logs from Kafka in a Storm topology. It provides an overview of these technologies and their features, and proposes an architecture to collect logs with Kafka, process them with Storm and PHP, and output results through FastCGI.

apache stormdata analysisphp
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)

By Andy Wingo. SDN and Network Programmability Meetup in Barcelona (VI) 21 June 2017 https://www.meetup.com/es-ES/SDN-and-Network-Programmability-Meetup-in-Barcelona /events/239667457/?eventId=239667457

meetupsdnsnabb
Flame Graphs
Flame Charts
Full System Coverage
… Without Running All These

Recommended for you

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Scaling Spark workloads on YARN and Mesos can provide significant performance improvements but the benefits vary across different workloads. Adding resources alone may not fully utilize the new nodes due to delay in scheduling tasks locally on the new nodes. Tuning the locality wait time parameter in Spark to quickly change task placement preference can help make better use of new resources. Dynamic executor allocation in Spark can also be enhanced to dynamically adjust configuration settings like locality wait time during auto-scaling.

spark summit 2015apache spark
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark

This document summarizes a presentation on Inferno, a system for scalable deep learning on Apache Spark. Inferno allows deep learning models built with Blaze, La Trobe University's deep learning system, to be trained faster using a Spark cluster. It coordinates distributed training of Blaze models across worker nodes, with optimized communication of weights and hyperparameters. Evaluation shows Inferno can train ResNet models on ImageNet up to 4-5 times faster than a single GPU. The presentation provides an overview of deep learning and Spark, demonstrates how Blaze allows easy model building, and explains Inferno's architecture for distributed deep learning training on Spark.

hadoop summiths16melb
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015

These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others. After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.

kafkacopycatconfluent
Deep System Coverage
Other Desirables
•  Safe for production use
•  Easy to use: self service
•  [Near] Real Time
•  Ad hoc / custom instrumentation
•  Complete documentation
•  Graph labels and units
•  Open source
•  Community
2. Undesirables
Tachometers
…especially with arbitrary color highlighting

Recommended for you

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning. Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib). Speakers: Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure. Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel. Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.

technologyhdfsflickr
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali

Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.

apache spark
ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016

Video: https://youtu.be/eO94l0aGLCA?t=3m37s . Talk by Brendan Gregg for ACM Applicative 2016 "System Methodology - Holistic Performance Analysis on Modern Systems Traditional systems performance engineering makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. For modern systems, we can choose the metrics, and can choose ones we need to support new holistic performance analysis methodologies. These methodologies provide faster, more accurate, and more complete analysis, and can provide a starting point for unfamiliar systems. Methodologies are especially helpful for modern applications and their workloads, which can pose extremely complex problems with no obvious starting point. There are also continuous deployment environments such as the Netflix cloud, where these problems must be solved in shorter time frames. Fortunately, with advances in system observability and tracers, we have virtually endless custom metrics to aid performance analysis. The problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems. System methodologies provide a starting point for analysis, as well as guidance for quickly moving through the metrics to root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, and more. This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. The focus is on single systems (any operating system), including single cloud instances, and quickly locating performance issues or exonerating the system. Many methodologies will be discussed, along with recommendations for their implementation, which may be as documented checklists of tools, or custom dashboards of supporting metrics. In general, you will learn to think differently about your systems, and how to ask better questions."

methodologiesperformance
Pie Charts
…for real-time metrics
usr	
   sys	
   wait	
   idle	
  
Doughnuts
usr	
   sys	
   wait	
   idle	
  
…like pie charts but worse
Traffic Lights
…when used for subjective metrics
These can be used for objective metrics
For subjective metrics (eg, IOPS/latency) try weather icons instead
RED == BAD (usually)
GREEN == GOOD (hopefully)
3. Requirements

Recommended for you

HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & Analysis

This document discusses application profiling and analysis techniques. It provides an overview of profiling, which records summary information during program execution to expose performance bottlenecks. Profiling can be done through sampling or instrumentation. It also discusses tracing, which records time-stamped events during execution. The document compares profiling and tracing and describes tools like PAPI and HPCToolkit that can be used for profiling applications.

papiprogram tracesapplication programming interface
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis

This document discusses application profiling and analysis. Profiling involves recording summary information during program execution to reflect performance behavior. It can expose bottlenecks and hotspots with low overhead. Profiling is implemented via sampling, which uses periodic interrupts, or instrumentation, which directly inserts measurement code. Tracing records significant execution points to reconstruct program behavior. Profiling provides summary statistics while tracing generates a large volume of event data. Tools like HPCToolkit use sampling and instrumentation to collect metrics that are correlated back to source code to analyze performance.

papiapplication programming interfacescalasca
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware Analysis

This document provides an overview of malware analysis, including both static and dynamic analysis techniques. Static analysis involves examining a file's code and components without executing it, such as identifying file types, checking hashes, and viewing strings. Dynamic analysis involves executing the malware in a controlled environment and monitoring its behavior and any system changes. Dynamic analysis tools discussed include Process Explorer, Process Monitor, and Autoruns to track malware processes, files accessed, and persistence mechanisms. Both static and dynamic analysis are needed to fully understand malware behavior.

malware infosec security
Acceptable T&Cs
•  Probably acceptable:
•  Probably not acceptable:
•  Check with your legal team
By	
  submi9ng	
  any	
  Ideas,	
  Customer	
  and	
  Authorized	
  Users	
  agree	
  
that:	
  ...	
  (iii)	
  all	
  right,	
  Ftle	
  and	
  interest	
  in	
  and	
  to	
  the	
  Ideas,	
  including	
  all	
  
associated	
  IP	
  Rights,	
  shall	
  be,	
  and	
  hereby	
  are,	
  assigned	
  to	
  [us]	
  
XXX,	
  Inc.	
  shall	
  have	
  a	
  royalty-­‐free,	
  worldwide,	
  transferable,	
  and	
  
perpetual	
  license	
  to	
  use	
  or	
  incorporate	
  into	
  the	
  Service	
  any	
  
suggesFons,	
  ideas,	
  enhancement	
  requests,	
  feedback,	
  or	
  other	
  
informaFon	
  provided	
  by	
  you	
  or	
  any	
  Authorized	
  User	
  relaFng	
  to	
  the	
  
Service.	
  
Acceptable Technical Debt
•  It must be worth the …
•  Extra complexity when debugging
•  Time to explain to others
•  Production reliability risk
•  Security risk
•  There is no such thing as a free trial
Known Overhead
•  Overhead must be known to be managed
–  T&Cs should not prohibit its measurement or publication
•  Sources of overhead:
–  CPU cycles
–  File system I/O
–  Network I/O
–  Installed software size
•  We will measure it
Low Overhead
•  Overhead should also be the lowest possible
–  1% CPU overhead means 1% more instances, and $$$
•  Things we try to avoid
–  Tracing every function/method call
–  Needless kernel/user data transfers
–  strace (ptrace), tcpdump, libpcap, …
•  Event logging doesn't scale

Recommended for you

SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs

Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible. In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."

sre performance
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement

VampirTrace provides instrumentation and run-time measurement capabilities. It allows for automatic, manual, and binary instrumentation. Run-time measurement includes collecting trace data behind the scenes and post-processing. Users have options to configure various settings like environment variables, hardware performance counters, memory allocation counters, filtering, and grouping. FAQ and troubleshooting information is also available.

Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...

ScicomP 2015 presentation discussing best practices for debugging ‪CUDA‬ and ‪‎OpenACC‬ applications with a case study on our collaboration with LLNL to bring debugging to the OpenPOWER stack and OMPT.

Scalable
•  Can the product scale to (say) 100,000 instances?
–  Atlas, our cloud-wide analysis tool, can
–  We tend to kill other monitoring tools that attempt this
•  Real-time dashboards showing all instances:
–  How does that work? Can it scale to 1k? … 100k?
–  Adrian Cockcroft's spigo can simulate protocols at scale
•  High overhead might be worth it: on-demand only
Useful
An instance analysis solution must provide
actionable information
that helps us improve performance
4. Methodologies
Methodologies
Methodologies pose the questions
for metrics to answer
Good monitoring/analysis tools should support
performance analysis methodologies

Recommended for you

ch11.ppt
ch11.pptch11.ppt
ch11.ppt

This chapter discusses virtual machines, network forensics, and live acquisitions. It covers detecting virtual machines on hosts, imaging virtual disks, and using virtual machines to examine malware. Network forensics topics include securing networks with layered defenses, performing live acquisitions, developing standard procedures, and reviewing logs. Common network tools are also outlined, such as Sysinternals, BackTrack, packet sniffers like Wireshark, and examining the Honeynet Project for information on attacks.

Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets

Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.

RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb

The recorded version of 'Best Of The World Webcast Series' [Webinar] where Jacob Holcomb speaks on 'RIoT (Raiding Internet of Things)' is available on CISOPlatform. Best Of The World Webcast Series are webinars where breakthrough/original security researchers showcase their study, to offer the CISO/security experts the best insights in information security. For more signup(it's free): www.cisoplatform.com

webinarcisoplatforminternetofthings
Drunk Man Anti-Method
•  Tune things at random until the problem goes away
Workload Characterization
Study the workload applied:
1.  Who
2.  Why
3.  What
4.  How
Target	
  Workload	
  
Workload Characterization
Eg, for CPUs:
1.  Who: which PIDs, programs, users
2.  Why: code paths, context
3.  What: CPU instructions, cycles
4.  How: changing over time
Target	
  Workload	
  
CPUs
Who
How What
Why

Recommended for you

Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk

Nathan Cerny and Andrew Dufour's talk on monitoring and tuning of your Chef server from ChefConf 2016.

Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix

This talk discusses Linux profiling using perf_events (also called "perf") based on Netflix's use of it. It covers how to use perf to get CPU profiling working and overcome common issues. The speaker will give a tour of perf_events features and show how Netflix uses it to analyze performance across their massive Amazon EC2 Linux cloud. They rely on tools like perf for customer satisfaction, cost optimization, and developing open source tools like NetflixOSS. Key aspects covered include why profiling is needed, a crash course on perf, CPU profiling workflows, and common "gotchas" to address like missing stacks, symbols, or profiling certain languages and events.

Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools

Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg. There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes. This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.

linux performance tools tracing
CPUs
Who
How What
Why
top,	
  htop!
perf record -g!
flame	
  graphs	
  
monitoring	
   perf stat -a -d!
Most Monitoring Products Today
Who
How What
Why
top,	
  htop!
perf record -g!
flame	
  Graphs	
  
monitoring	
   perf stat -a -d!
The USE Method
•  For every resource, check:
1.  Utilization
2.  Saturation
3.  Errors
•  Saturation is queue length or queued time
•  Start by drawing a functional (block) diagram of your
system / software / environment
Resource	
  
UFlizaFon	
  
(%)	
  X	
  
USE Method for Hardware
Include busses & interconnects!

Recommended for you

Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud

My talk for BayLISA, Oct 2013, launching the Systems Performance book. Operating system performance analysis and tuning leads to a better end-user experience and lower costs, especially for cloud computing environments that pay by the operating system instance. This book covers concepts, strategy, tools and tuning for Unix operating systems, with a focus on Linux- and Solaris-based systems. The book covers the latest tools and techniques, including static and dynamic tracing, to get the most out of your systems.

performance
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018

1. The document describes the role of an AI engineer at a computer vision startup called Umbo. It discusses developing and maintaining computer vision services, building machine learning pipelines, and lessons learned. 2. Key responsibilities of an AI engineer include developing and maintaining computer vision services, building machine learning pipelines to improve services and measure model performance, and debugging and refactoring code. 3. The author learned that domain knowledge is important, Python is a unifying language, and collaboration between researchers and engineers is necessary. The future may see more high-quality backend development to support machine learning services.

aibackendpython
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method

Delivered at the FISL13 conference in Brazil: http://www.youtube.com/watch?v=K9w2cipqfvc This talk introduces the USE Method: a simple strategy for performing a complete check of system performance health, identifying common bottlenecks and errors. This methodology can be used early in a performance investigation to quickly identify the most severe system performance issues, and is a methodology the speaker has used successfully for years in both enterprise and cloud computing environments. Checklists have been developed to show how the USE Method can be applied to Solaris/illumos-based and Linux-based systems. Many hardware and software resource types have been commonly overlooked, including memory and I/O busses, CPU interconnects, and kernel locks. Any of these can become a system bottleneck. The USE Method provides a way to find and identify these. This approach focuses on the questions to ask of the system, before reaching for the tools. Tools that are ultimately used include all the standard performance tools (vmstat, iostat, top), and more advanced tools, including dynamic tracing (DTrace), and hardware performance counters. Other performance methodologies are included for comparison: the Problem Statement Method, Workload Characterization Method, and Drill-Down Analysis Method.

methodologiesperformancemonitoring
hXp://www.brendangregg.com/USEmethod/use-­‐linux.html	
  
Most Monitoring Products Today
•  Showing what is and is not commonly measured
•  Score: 8 out of 33 (24%)
•  We can do better… U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
  
U	
   S	
   E	
   U	
   S	
   E	
   U	
   S	
   E	
   U	
   S	
   E	
  
Other Methodologies
•  There are many more:
–  Drill-Down Analysis Method
–  Time Division Method
–  Stack Profile Method
–  Off-CPU Analysis
–  …
–  I've covered these in previous talks & books
5. Our Tools
Atlas

Recommended for you

Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace

The document provides an overview of performance analysis tools like tracing and profiling. It discusses different tracing approaches like print statements, logging frameworks, and debuggers. It introduces DTrace as a dynamic instrumentation tool that allows tracing production systems with zero probe effect. A case study demonstrates using DTrace to analyze NFS latency issues. The document also discusses tracing tools for Linux like ftrace, perf, SystemTap, and eBPF.

performancedtracetracing
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL

Guider is an integrated runtime performance analyzer for Linux that collects system resource and task data in real-time. It traces numerous system operations and visualizes complex performance data. Guider provides highly readable reports and debugging features to help optimize performance. It is open source, system-wide, easy to use, accurate, and light on system resources. Guider can monitor and collect system stats, trace threads, functions, and system calls, and control tasks for testing. Future work includes real-time user-level function tracing and a GUI client for remote control and visualization.

guiderperformancelinux
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1

The first set of slides from the second day of the Thingsquare IoT workshop: http://thingsquare.com/training/

contikicontiki 3.xinternet of things
BaseAMI
•  Many sources for instance metrics & analysis
–  Atlas, Vector, sar, perf-tools (ftrace, perf_events), …
•  Currently not using 3rd party monitoring vendor tools
Linux	
  (usually	
  Ubuntu)	
  
Java	
  (JDK	
  7	
  or	
  8)	
  
Tomcat	
  GC	
  and	
  
thread	
  
dump	
  
logging	
  
hystrix,	
  metrics	
  (Servo),	
  
health	
  check	
  
OpFonal	
  Apache,	
  
memcached,	
  Node.js,	
  
…	
  
Atlas,	
  S3	
  log	
  rotaFon,	
  
sar,	
  erace,	
  perf,	
  stap,	
  
perf-­‐tools	
  
Vector,	
  pcp	
  
ApplicaFon	
  war	
  files,	
  
plahorm,	
  base	
  servelet	
  
Netflix Atlas
Netflix Atlas
Select	
  Instance	
  
Historical	
  Metrics	
  
Select	
  Metrics	
  
Netflix Vector

Recommended for you

Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session

This document discusses Splunk's data onboarding process, which provides a systematic way to ingest new data sources into Splunk. It ensures new data is instantly usable and valuable. The process involves several steps: pre-boarding to identify the data and required configurations; building index-time configurations; creating search-time configurations like extractions and lookups; developing data models; testing; and deploying the new data source. Following this process helps get new data onboarding right the first time and makes the data immediately useful.

sf2015
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance

This document provides a performance engineer's predictions for computing performance trends in 2021 and beyond. The engineer discusses trends in processors, memory, disks, networking, runtimes, kernels, hypervisors, and observability. For processors, predictions include multi-socket systems becoming less common, the future of simultaneous multithreading being unclear, practical core count limits being reached in the 2030s, and more processor vendors including ARM-based and RISC-V options. Memory predictions focus on many workloads being memory-bound currently.

performancecomputingcloud computing
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking

The document discusses challenges with processor benchmarking and provides recommendations. It summarizes a case study where a popular CPU benchmark claimed a new processor was 2.6x faster than Intel, but detailed analysis found the benchmark was testing division speed, which accounted for only 0.1% of cycles on Netflix servers. The document advocates for low-level, active benchmarking and profiling over statistical analysis. It also provides a checklist for evaluating benchmarks and cautions that increased processor complexity and cloud environments make accurate benchmarking more difficult.

processorscpusbenchmarking
Netflix Vector
Near	
  real-­‐7me,	
  
per-­‐second	
  metrics	
  
Flame	
  Graphs	
  
Select	
  
Metrics	
  
Select	
  Instance	
  
Java CPU Flame Graphs
Needs -XX:+PreserveFramePointer
and perf-map-agent
Java CPU Flame Graphs
Java	
   JVM	
  
Kernel	
  
sar
•  System Activity Reporter. Archive of metrics, eg:
•  Metrics are also in Atlas and Vector
•  Linux sar is well designed: units, groups
$ sar -n DEV!
Linux 3.13.0-49-generic (prod0141) !06/06/2015!_x86_64_ !(16 CPU)!
!
12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil!
12:05:01 AM eth0 4824.26 3941.37 919.57 15706.14 0.00 0.00 0.00 0.00!
12:05:01 AM lo 23913.29 23913.29 17677.23 17677.23 0.00 0.00 0.00 0.00!
12:15:01 AM eth0 4507.22 3749.46 909.03 12481.74 0.00 0.00 0.00 0.00!
12:15:01 AM lo 23456.94 23456.94 14424.28 14424.28 0.00 0.00 0.00 0.00!
12:25:01 AM eth0 10372.37 9990.59 1219.22 27788.19 0.00 0.00 0.00 0.00!
12:25:01 AM lo 25725.15 25725.15 29372.20 29372.20 0.00 0.00 0.00 0.00!
12:35:01 AM eth0 4729.53 3899.14 914.74 12773.97 0.00 0.00 0.00 0.00!
12:35:01 AM lo 23943.61 23943.61 14740.62 14740.62 0.00 0.00 0.00 0.00!
[…]!

Recommended for you

Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)

This document provides an overview of using eBPF (extended Berkeley Packet Filter) to quickly get performance wins as a sysadmin. It recommends installing BCC and bpftrace tools to easily find issues like periodic processes, misconfigurations, unexpected TCP sessions, or slow file system I/O. A case study examines using biosnoop to identify which processes were causing disk latency issues. The document suggests thinking like a sysadmin first by running tools, then like a programmer if a problem requires new tools. It also outlines recommended frontends depending on use cases and provides references to learn more about BPF.

bpfebpfperformance
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started

Talk for Facebook Systems@Scale 2021 by Brendan Gregg: "BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. But with many different front-ends and languages, and years of evolution, finding the right starting point can be hard. This talk will make it easy, showing how to install and run selected BPF tools in the bcc and bpftrace open source projects for some quick wins. Think like a sysadmin, not like a programmer."

linuxbpfperformance
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)

Talk by Brendan Gregg for USENIX LISA 2021. https://www.youtube.com/watch?v=5nN1wjA_S30 . "The future of computer performance involves clouds with hardware hypervisors and custom processors, servers running a new type of BPF software to allow high-speed applications and kernel customizations, observability of everything in production, new Linux kernel technologies, and more. This talk covers interesting developments in systems and computing performance, their challenges, and where things are headed."

performance
sar Observability
perf-tools
•  Some front-ends to Linux ftrace & perf_events
–  Advanced, custom kernel observability when needed (rare)
–  https://github.com/brendangregg/perf-tools
–  Unsupported hacks: see WARNINGs
•  ftrace
–  First added to Linux 2.6.27
–  A collection of capabilities, used via /sys/kernel/debug/tracing/
•  perf_events
–  First added to Linux 2.6.31
–  Tracer/profiler multi-tool, used via "perf" command
perf-tools: funccount
•  Eg, count a kernel function call rate:
•  Other perf-tools can then instrument these in more detail
# ./funccount -i 1 'bio_*'!
Tracing "bio_*"... Ctrl-C to end.!
!
FUNC COUNT!
bio_attempt_back_merge 26!
bio_get_nr_vecs 361!
bio_alloc 536!
bio_alloc_bioset 536!
bio_endio 536!
bio_free 536!
bio_fs_destructor 536!
bio_init 536!
bio_integrity_enabled 536!
bio_put 729!
bio_add_page 1004!
!
[...]!
Counts	
  are	
  in-­‐kernel,	
  
for	
  low	
  overhead	
  
perf-tools (so far…)

Recommended for you

BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)

USENIX LISA2021 talk by Brendan Gregg (https://www.youtube.com/watch?v=_5Z2AU7QTH4). This talk is a deep dive that describes how BPF (eBPF) works internally on Linux, and dissects some modern performance observability tools. Details covered include the kernel BPF implementation: the verifier, JIT compilation, and the BPF execution environment; the BPF instruction set; different event sources; and how BPF is used by user space, using bpftrace programs as an example. This includes showing how bpftrace is compiled to LLVM IR and then BPF bytecode, and how per-event data and aggregated map data are fetched from the kernel.

bpfebpflinux
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started

Keynote by Brendan Gregg for the eBPF summit, 2020. How to get started finding performance wins using the BPF (eBPF) technology. This short talk covers the quickest and easiest way to find performance wins using BPF observability tools on Linux.

bpfebpflinux
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance

Talk for YOW! by Brendan Gregg. "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud. "

linuxperformance
eBPF
•  Currently being integrated. Efficient (JIT) in-kernel maps.
•  Measure latency, heat maps, …
eBPF
eBPF will make a profound difference to
monitoring on Linux systems
There will be an arms race to support it, post Linux 4.1+
If it's not on your roadmap, it should be
Summary
Requirements
•  Acceptable T&Cs
•  Acceptable technical debt
•  Known overhead
•  Low overhead
•  Scalable
•  Useful

Recommended for you

re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix

This document provides an overview of Brendan Gregg's presentation on BPF performance analysis at Netflix. It discusses: - Why BPF is changing the Linux OS model to become more event-based and microkernel-like. - The internals of BPF including its origins, instruction set, execution model, and how it is integrated into the Linux kernel. - How BPF enables a new class of custom, efficient, and safe performance analysis tools for analyzing various Linux subsystems like CPUs, memory, disks, networking, applications, and the kernel. - Examples of specific BPF-based performance analysis tools developed by Netflix, AWS, and others for analyzing tasks, scheduling, page faults

bpflinuxperformance
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software

BPF (Berkeley Packet Filter) has evolved from a limited virtual machine for efficient packet filtering to a new type of software called extended BPF. Extended BPF allows for custom, efficient, and production-safe performance analysis tools and observability programs to be run in the Linux kernel through BPF. It enables new event-based applications running as BPF programs attached to various kernel events like kprobes, uprobes, tracepoints, sockets, and more. Major companies like Facebook, Google, and Netflix are using BPF programs for tasks like intrusion detection, container security, firewalling, and observability with over 150,000 AWS instances running BPF programs. BPF provides a new program model and security features compared

performancelinuxbpf
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance

Talk by Brendan Gregg for USENIX LISA 2019: Linux Systems Performance. Abstract: " Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud."

linuxperformanceperformance tuning
Methodologies
Support for:
•  Workload Characterization
•  The USE Method
•  …
Not starting with metrics in search of uses
Desirables
Instrument These
With full eBPF support
Linux has awesome instrumentation: use it!
Links & References
•  Netflix Vector
–  https://github.com/netflix/vector
–  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html
•  Netflix Atlas
–  http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
•  Heat Maps
–  http://www.brendangregg.com/heatmaps.html
–  http://www.brendangregg.com/HeatMaps/latency.html
•  Flame Graphs
–  http://www.brendangregg.com/flamegraphs.html
–  http://techblog.netflix.com/2014/11/nodejs-in-flames.html
•  Frequency Trails: http://www.brendangregg.com/frequencytrails.html
•  Methodology
–  http://www.brendangregg.com/methodology.html
–  http://www.brendangregg.com/USEmethod/use-linux.html
•  perf-tools: https://github.com/brendangregg/perf-tools
•  eBPF: http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html
•  Images:
–  horse: Microsoft Powerpoint clip art
–  gauge: https://github.com/thlorenz/d3-gauge
–  eBPF ponycorn: Deirdré Straughan & General Zoi's Pony Creator

Recommended for you

LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools

This document discusses Brendan Gregg's opinions on various tracing tools including sysdig, perf, ftrace, eBPF, bpftrace, and BPF perf tools. It provides a table comparing the scope, capability, and ease of use of these tools. It then gives an example of using BPF perf tools to analyze readahead performance. Finally, it outlines desired additions to tracing capabilities and BPF helpers as well as challenges in areas like function tracing without frame pointers.

bpflinux
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability

Here is a bpftrace program to measure scheduler latency for ICMP echo requests: #!/usr/local/bin/bpftrace kprobe:icmp_send { @start[tid] = nsecs; } kprobe:__netif_receive_skb_core { @diff[tid] = hist(nsecs - @start[tid]); delete(@start[tid]); } END { print(@diff); clear(@diff); } This traces the time between the icmp_send kernel function (when the packet is queued for transmit) and the __netif_receive_skb_core function (when the response packet is received). The

bpfebpflinu
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix

This document summarizes Brendan Gregg's experiences working at Netflix for over 4.5 years. Some key points include: - The company culture at Netflix is openly documented and encourages independent decision making, open communication, and sharing information broadly. - Gregg's first meeting involved an expected "intense debate" but was actually professional and respectful. - Netflix values judgment, communication, curiosity, courage, and other traits that allow the culture and architecture to complement each other. - The cloud architecture is designed to be resilient through practices like chaos engineering and rapid deployments without approvals, in line with the culture of freedom and responsibility.

culture
Thanks
•  Questions?
•  http://techblog.netflix.com
•  http://slideshare.net/brendangregg
•  http://www.brendangregg.com
•  bgregg@netflix.com
•  @brendangregg
Jun	
  2015	
  

More Related Content

What's hot

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
P. Taylor Goetz
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
Xin Wang
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Andrew Yongjoon Kong
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
Robert Evans
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
Ruslan Meshenberg
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Shelan Perera
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
毅 吕
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
Igalia
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Yahoo Developer Network
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 

What's hot (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using OpenstackCloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Arc305 how netflix leverages multiple regions to increase availability an i...
Arc305 how netflix leverages multiple regions to increase availability   an i...Arc305 how netflix leverages multiple regions to increase availability   an i...
Arc305 how netflix leverages multiple regions to increase availability an i...
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Analysis big data by use php with storm
Analysis big data by use php with stormAnalysis big data by use php with storm
Analysis big data by use php with storm
 
Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 

Similar to Monitorama 2015 Netflix Instance Analysis

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
Brendan Gregg
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & Analysis
Rishi Pathak
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis
Rishi Pathak
 
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware Analysis
Andrew McNicol
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
Brendan Gregg
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
PTIHPA
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
ch11.ppt
ch11.pptch11.ppt
ch11.ppt
contactatkmdp
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb
Priyanka Aash
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
Andrew DuFour
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
Brendan Gregg
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018
Chun-Yu Tseng
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
Brendan Gregg
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
Graeme Jenkinson
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL
Peace Lee
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Adam Dunkels
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
Splunk
 

Similar to Monitorama 2015 Netflix Instance Analysis (20)

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & Analysis
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis
 
Introduction to Malware Analysis
Introduction to Malware AnalysisIntroduction to Malware Analysis
Introduction to Malware Analysis
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
ch11.ppt
ch11.pptch11.ppt
ch11.ppt
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
RIoT (Raiding Internet of Things) by Jacob Holcomb
RIoT  (Raiding Internet of Things)  by Jacob HolcombRIoT  (Raiding Internet of Things)  by Jacob Holcomb
RIoT (Raiding Internet of Things) by Jacob Holcomb
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018from ai.backend import python @ pycontw2018
from ai.backend import python @ pycontw2018
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Performance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTracePerformance analysis and troubleshooting using DTrace
Performance analysis and troubleshooting using DTrace
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 

More from Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
Brendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
Brendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
Brendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
Brendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
Brendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
Brendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
Brendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
Brendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
Brendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
Brendan Gregg
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
Brendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
Brendan Gregg
 

More from Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 

Recently uploaded

7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 

Recently uploaded (20)

7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 

Monitorama 2015 Netflix Instance Analysis

  • 1. Netflix Instance Performance Analysis Requirements Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@netflix.com @brendangregg Jun  2015  
  • 2. Monitoring companies are selling faster horses I want to buy a car
  • 3. Server/Instance Analysis Potential In the last 10 years… •  More Linux •  More Linux metrics •  Better visualizations •  Containers Conditions ripe for innovation: where is our Henry Ford?
  • 4. This Talk •  Instance analysis: system resources, kernel, processes –  For customers: what you can ask for –  For vendors: our desirables & requirements –  What we are building (and open sourcing) at Netflix to modernize instance performance analysis (Vector, …)
  • 5. •  Over 60M subscribers •  FreeBSD CDN for content delivery •  Massive AWS EC2 Linux cloud •  Many monitoring/analysis tools •  Awesome place to work
  • 6. Agenda 1.  Desirables 2.  Undesirables 3.  Requirements 4.  Methodologies 5.  Our Tools
  • 11. Histograms …  or  a  density  plot  
  • 20. … Without Running All These
  • 22. Other Desirables •  Safe for production use •  Easy to use: self service •  [Near] Real Time •  Ad hoc / custom instrumentation •  Complete documentation •  Graph labels and units •  Open source •  Community
  • 25. Pie Charts …for real-time metrics usr   sys   wait   idle  
  • 26. Doughnuts usr   sys   wait   idle   …like pie charts but worse
  • 27. Traffic Lights …when used for subjective metrics These can be used for objective metrics For subjective metrics (eg, IOPS/latency) try weather icons instead RED == BAD (usually) GREEN == GOOD (hopefully)
  • 29. Acceptable T&Cs •  Probably acceptable: •  Probably not acceptable: •  Check with your legal team By  submi9ng  any  Ideas,  Customer  and  Authorized  Users  agree   that:  ...  (iii)  all  right,  Ftle  and  interest  in  and  to  the  Ideas,  including  all   associated  IP  Rights,  shall  be,  and  hereby  are,  assigned  to  [us]   XXX,  Inc.  shall  have  a  royalty-­‐free,  worldwide,  transferable,  and   perpetual  license  to  use  or  incorporate  into  the  Service  any   suggesFons,  ideas,  enhancement  requests,  feedback,  or  other   informaFon  provided  by  you  or  any  Authorized  User  relaFng  to  the   Service.  
  • 30. Acceptable Technical Debt •  It must be worth the … •  Extra complexity when debugging •  Time to explain to others •  Production reliability risk •  Security risk •  There is no such thing as a free trial
  • 31. Known Overhead •  Overhead must be known to be managed –  T&Cs should not prohibit its measurement or publication •  Sources of overhead: –  CPU cycles –  File system I/O –  Network I/O –  Installed software size •  We will measure it
  • 32. Low Overhead •  Overhead should also be the lowest possible –  1% CPU overhead means 1% more instances, and $$$ •  Things we try to avoid –  Tracing every function/method call –  Needless kernel/user data transfers –  strace (ptrace), tcpdump, libpcap, … •  Event logging doesn't scale
  • 33. Scalable •  Can the product scale to (say) 100,000 instances? –  Atlas, our cloud-wide analysis tool, can –  We tend to kill other monitoring tools that attempt this •  Real-time dashboards showing all instances: –  How does that work? Can it scale to 1k? … 100k? –  Adrian Cockcroft's spigo can simulate protocols at scale •  High overhead might be worth it: on-demand only
  • 34. Useful An instance analysis solution must provide actionable information that helps us improve performance
  • 36. Methodologies Methodologies pose the questions for metrics to answer Good monitoring/analysis tools should support performance analysis methodologies
  • 37. Drunk Man Anti-Method •  Tune things at random until the problem goes away
  • 38. Workload Characterization Study the workload applied: 1.  Who 2.  Why 3.  What 4.  How Target  Workload  
  • 39. Workload Characterization Eg, for CPUs: 1.  Who: which PIDs, programs, users 2.  Why: code paths, context 3.  What: CPU instructions, cycles 4.  How: changing over time Target  Workload  
  • 41. CPUs Who How What Why top,  htop! perf record -g! flame  graphs   monitoring   perf stat -a -d!
  • 42. Most Monitoring Products Today Who How What Why top,  htop! perf record -g! flame  Graphs   monitoring   perf stat -a -d!
  • 43. The USE Method •  For every resource, check: 1.  Utilization 2.  Saturation 3.  Errors •  Saturation is queue length or queued time •  Start by drawing a functional (block) diagram of your system / software / environment Resource   UFlizaFon   (%)  X  
  • 44. USE Method for Hardware Include busses & interconnects!
  • 46. Most Monitoring Products Today •  Showing what is and is not commonly measured •  Score: 8 out of 33 (24%) •  We can do better… U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E   U   S   E  
  • 47. Other Methodologies •  There are many more: –  Drill-Down Analysis Method –  Time Division Method –  Stack Profile Method –  Off-CPU Analysis –  … –  I've covered these in previous talks & books
  • 49. BaseAMI •  Many sources for instance metrics & analysis –  Atlas, Vector, sar, perf-tools (ftrace, perf_events), … •  Currently not using 3rd party monitoring vendor tools Linux  (usually  Ubuntu)   Java  (JDK  7  or  8)   Tomcat  GC  and   thread   dump   logging   hystrix,  metrics  (Servo),   health  check   OpFonal  Apache,   memcached,  Node.js,   …   Atlas,  S3  log  rotaFon,   sar,  erace,  perf,  stap,   perf-­‐tools   Vector,  pcp   ApplicaFon  war  files,   plahorm,  base  servelet  
  • 51. Netflix Atlas Select  Instance   Historical  Metrics   Select  Metrics  
  • 53. Netflix Vector Near  real-­‐7me,   per-­‐second  metrics   Flame  Graphs   Select   Metrics   Select  Instance  
  • 54. Java CPU Flame Graphs
  • 55. Needs -XX:+PreserveFramePointer and perf-map-agent Java CPU Flame Graphs Java   JVM   Kernel  
  • 56. sar •  System Activity Reporter. Archive of metrics, eg: •  Metrics are also in Atlas and Vector •  Linux sar is well designed: units, groups $ sar -n DEV! Linux 3.13.0-49-generic (prod0141) !06/06/2015!_x86_64_ !(16 CPU)! ! 12:00:01 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil! 12:05:01 AM eth0 4824.26 3941.37 919.57 15706.14 0.00 0.00 0.00 0.00! 12:05:01 AM lo 23913.29 23913.29 17677.23 17677.23 0.00 0.00 0.00 0.00! 12:15:01 AM eth0 4507.22 3749.46 909.03 12481.74 0.00 0.00 0.00 0.00! 12:15:01 AM lo 23456.94 23456.94 14424.28 14424.28 0.00 0.00 0.00 0.00! 12:25:01 AM eth0 10372.37 9990.59 1219.22 27788.19 0.00 0.00 0.00 0.00! 12:25:01 AM lo 25725.15 25725.15 29372.20 29372.20 0.00 0.00 0.00 0.00! 12:35:01 AM eth0 4729.53 3899.14 914.74 12773.97 0.00 0.00 0.00 0.00! 12:35:01 AM lo 23943.61 23943.61 14740.62 14740.62 0.00 0.00 0.00 0.00! […]!
  • 58. perf-tools •  Some front-ends to Linux ftrace & perf_events –  Advanced, custom kernel observability when needed (rare) –  https://github.com/brendangregg/perf-tools –  Unsupported hacks: see WARNINGs •  ftrace –  First added to Linux 2.6.27 –  A collection of capabilities, used via /sys/kernel/debug/tracing/ •  perf_events –  First added to Linux 2.6.31 –  Tracer/profiler multi-tool, used via "perf" command
  • 59. perf-tools: funccount •  Eg, count a kernel function call rate: •  Other perf-tools can then instrument these in more detail # ./funccount -i 1 'bio_*'! Tracing "bio_*"... Ctrl-C to end.! ! FUNC COUNT! bio_attempt_back_merge 26! bio_get_nr_vecs 361! bio_alloc 536! bio_alloc_bioset 536! bio_endio 536! bio_free 536! bio_fs_destructor 536! bio_init 536! bio_integrity_enabled 536! bio_put 729! bio_add_page 1004! ! [...]! Counts  are  in-­‐kernel,   for  low  overhead  
  • 61. eBPF •  Currently being integrated. Efficient (JIT) in-kernel maps. •  Measure latency, heat maps, …
  • 62. eBPF eBPF will make a profound difference to monitoring on Linux systems There will be an arms race to support it, post Linux 4.1+ If it's not on your roadmap, it should be
  • 64. Requirements •  Acceptable T&Cs •  Acceptable technical debt •  Known overhead •  Low overhead •  Scalable •  Useful
  • 65. Methodologies Support for: •  Workload Characterization •  The USE Method •  … Not starting with metrics in search of uses
  • 67. Instrument These With full eBPF support Linux has awesome instrumentation: use it!
  • 68. Links & References •  Netflix Vector –  https://github.com/netflix/vector –  http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html •  Netflix Atlas –  http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html •  Heat Maps –  http://www.brendangregg.com/heatmaps.html –  http://www.brendangregg.com/HeatMaps/latency.html •  Flame Graphs –  http://www.brendangregg.com/flamegraphs.html –  http://techblog.netflix.com/2014/11/nodejs-in-flames.html •  Frequency Trails: http://www.brendangregg.com/frequencytrails.html •  Methodology –  http://www.brendangregg.com/methodology.html –  http://www.brendangregg.com/USEmethod/use-linux.html •  perf-tools: https://github.com/brendangregg/perf-tools •  eBPF: http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html •  Images: –  horse: Microsoft Powerpoint clip art –  gauge: https://github.com/thlorenz/d3-gauge –  eBPF ponycorn: Deirdré Straughan & General Zoi's Pony Creator
  • 69. Thanks •  Questions? •  http://techblog.netflix.com •  http://slideshare.net/brendangregg •  http://www.brendangregg.com •  bgregg@netflix.com •  @brendangregg Jun  2015