5. With more than 30 million streaming members in
the United States, Canada, Latin America, the
United Kingdom, Ireland and the Nordics, Netflix is
the world's leading internet subscription service for
enjoying movies and TV programs streamed over
the internet to PCs, Macs and TV.
Source: http://ir.netflix.com
Tweet @jedberg with feedback!
6. The Netflix Way
• Everything is “built for three”
• Fully automated build tools to test and
make packages
• Fully automated machine image bakery
• Fully automated image deployment
• Independent teams responsible for
both Dev and Ops
Tweet @jedberg with feedback!
9. Automate all the things!
• Application startup
• Configuration
• Code deployment
• System deployment
Tweet @jedberg with feedback!
10. Automation
• Standard base image
• Tools to manage all the systems
• Automated code deployment
Tweet @jedberg with feedback!
11. Shared state should be
stored in a shared service
Data on an instance should
be replicated to other
instances
Tweet @jedberg with feedback!
12. “Build for Three”
We hold a boot camp for new engineers to teach them how
to build for a highly distributed environment.
Tweet @jedberg with feedback!
14. Netflix on AWS
2012 2012 2012
IPv6 IPv6 IPv6
Open Connect
Tweet @jedberg with feedback!
15. Highly aligned, loosely coupled
• Services are built by different teams
who work together to figure out what
each service will provide.
• The service owner publishes an API
that anyone can use.
Tweet @jedberg with feedback!
16. Advantages to a Service
Oriented Architecture
• Easier auto-scaling
• Easier capacity planning
• Identify problematic code-paths more easily
• Narrow in the effects of a change
• More efficient local caching
Tweet @jedberg with feedback!
17. Freedom and Responsibility
• Developers deploy when they want
• They also manage their own capacity
and autoscaling
• And fix anything that breaks at 4am!
Tweet @jedberg with feedback!
18. All systems choices assume
some part will fail at some
point.
Tweet @jedberg with feedback!
19. The Monkey Theory
• Simulate things
that go wrong
• Find things that
are different
Tweet @jedberg with feedback!
20. Execution
Photo from I, Robot, copyright 20th Century Fox
Tweet @jedberg with feedback!
21. Netflix built a global PaaS
• Service Oriented
Architecture
• HTTP/Rest interfaces
between services
Tweet @jedberg with feedback!
22. Netflix PaaS features
• Supports all regions and zones
• Multiple accounts
• Cross region/account replication
• Internationalized, localized and GeoIP routed
• Advanced key management
• Autoscaling with 1000s of instances
• Monitoring and alerting on millions of metrics
Tweet @jedberg with feedback!
23. What AWS Provides
• Instances
• Machine Images
• Elastic IPs
• Load Balancers
• Security groups / Autoscaling groups
• Availability zones and regions
Tweet @jedberg with feedback!
24. Linux Base AMI (CentOS or Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring monitoring Tomcat
Log Rotation
to S3 Application war file, base Healthcheck, status
GC and servlet, platform, interface servelets, JMX interface,
Appdynamics thread dump jars for dependent services Servo autoscale
Machine Agent logging
Tweet @jedberg with feedback!
25. The Netflix Platform
Discovery
(Eureka)Entrypoints Circut Breakers (Hystrix)
(Edda)Configuration Cassandra (Priam &
(Archaius) Astyanax & CassJMeter)
Zookeeper (Exhibitor) Cryptex
logging (Blitz4j & Honu) AKMSEvCache
NIWS Proxiesi18n
Geo L10n
Base Open Source
Tweet @jedberg with feedback!
27. N
ov C
D r u ra
e to
c
20
12 A
x sty
Fe an
b S
Tweet @jedberg with feedback!
o er a
M Pr v
ar m ia
C
A e r as
sJ
pr
Ex M
M r hi
b
et
a
y ito
Ju
n A
s rch
Ju A a
l d sg iu
ar
C
A
Open Source at Netflix
M ha
Edda
Blitz4j
ug
Hystrix
on os
ke
Governator
Se y
p Eu
a re
O k
ct
28. Finding things
• Discovery (Eureka)
• Application to instance mapping
• Heartbeat to keep track of health
• Entrypoints (Edda)
• Local database of AWS resources
• NIWS (Netflix Internal Web Service)
• On instance software load balancer
• Handles retry logic
• Geo (Geolocation library)
• Provides IP to Lat/Lon mapping for any service that needs it.
Tweet @jedberg with feedback!
29. Entrypoints (Edda)
• REST API
• GET /REST/v2/instance/$id
• Keeps track of all resources
• Autoscaling groups, EIPs, Instances,
Applications, Clusters, History
Tweet @jedberg with feedback!
30. Entrypoints Exploration
Find all active instances GET /REST/v2/view/instances
Find all instances in a GET /REST/v2/group/clusters
cluster
Show only ASG name, /v2/aws/autoScalingGroups/edda-v123;_pp:
(autoScalingGroupName,instances:
instance ID and health (instanceId,lifecycleState))
Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i-
96f3ca3a
particular instance?
Tweet @jedberg with feedback!
31. Keeping it all Straight
• Configuration (Archaius)
• Global variables (Fast properties)
• Base
• Base system. Prod vs. Test, etc
• Zookeeper (Curator)
• Locks, other similar coordination
• Logging (Blitz4j and Honu)
• Keep track of what happened and store it for
post analysis.
Tweet @jedberg with feedback!
32. Keeping it Secure
• Cryptex
• Service for key management
• High, medium and low value keys
• AKMS (Amazon Key Management System)
• Hands out keys to instances (and dev boxes) so
they don’t have to store the key on the instance
Tweet @jedberg with feedback! For more info, see SEC201: Security Panel
33. Storing it
• Cassandra (Priam, astyanax)
• Configure and access Cassandra
• Provide OO abstractions handle
connection pooling, discovery of hosts
• EVCache (Eccentric Volatile Cache)
• Wrapper for memcached to handle zone
awareness and replication
• Proxies
• Get data out of the datacenter and into
the cloud.
Tweet @jedberg with feedback!
34. Data
What do we do with it all?
Tweet @jedberg with feedback!
35. We store it!
• Cache (memcached)
• Cassandra
• RDS (MySql)
Tweet @jedberg with feedback!
51. Netflix has moved the
granularity from the
instance to the cluster
Tweet @jedberg with feedback!
52. Why Bake?
Traditional:
•launch OS Generic AMI
•install packages Instance
•install app
Netflix:
•launch OS+app
App AMI Instance
Tweet @jedberg with feedback!
53. Getting Baked
Artifactory
Artifactory app bundles
Ivy
snapshot / release
libraries
libraries / apps
Jenkins
Jenkins resolve
resolve test
test publish
publish
sync
sync compile
compile build
build report
report
source
Perforce / /Git
Perforce Git Ant targets Groovy all over
Tweet @jedberg with feedback!
54. Base Image
Baking S3 / EBS
foundation
foundation
AMI
AMI
Linux: CentOS, Fedora, Ubuntu
base
base
AMI
AMI
mount snapshot
Ready
for
Yum // Apt
Yum Apt app
install Bakery
Bakery bake
AWS
RPMs: Apache, Java...
ec2 slave instances
Tweet @jedberg with feedback!
55. App Image
Baking S3 / EBS
base AMI
base AMI
Linux, Apache, Java, Tomcat
app
app
AMI
AMI
mount snapshot
Jenkins // Yum //
Jenkins Yum Ready
Artifactory
Artifactory
to launch!
install Bakery
Bakery
AWS
app bundle
ec2 slave instances
Tweet @jedberg with feedback!
56. Linux Base AMI (CentOS or Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring monitoring Tomcat
Log Rotation
to S3 Application war file, base Healthcheck, status
GC and servlet, platform, interface servelets, JMX interface,
Appdynamics thread dump jars for dependent services Servo autoscale
Machine Agent logging
Tweet @jedberg with feedback!
57. Linux Base AMI (CentOS or Ubuntu)
Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent
Monitoring monitoring JBoss
Log Rotation
to S3 Application war file, base Healthcheck, status
GC and servlet, platform, interface servelets, JMX interface,
Appdynamics thread dump jars for dependent services Servo autoscale
Machine Agent logging
Tweet @jedberg with feedback!
58. Linux Base AMI (CentOS or Ubuntu)
Optional
Python
Apache
monitoring
Monitoring Django
Log Rotation
to S3 Application file, base
server, platform, interface
Appdynamics logging libs for dependent services
Machine Agent
Tweet @jedberg with feedback!
59. The Monkey Theory
• Simulate things
that go wrong
• Find things that
are different
Tweet @jedberg with feedback!
60. The simian army
• Chaos -- Kills random instances
• Chaos Gorilla -- Kills zones
• Chaos Kong -- Kills regions
• Latency -- Degrades network and injects faults
• Conformity -- Looks for outliers
• Circus -- Kills and launches instances to maintain zone balance
• Doctor -- Fixes unhealthy resources
• Janitor -- Cleans up unused resources
• Howler -- Yells about bad things like Amazon limit violations
• Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army
65. Alert Systems
CORE
CORE
Atlas Event
Event
Paging
Paging
Service
Gateway Service
alerting
Gateway
alerting
CORE
CORE
Appdynamics Agent Amazon
Amazon
Agent SES
api
SES
api
CORE
CORE
Agent
Agent
api
api
Other
Other
Team’ss
Team’
Agent
Agent
Tweet @jedberg with feedback!
68. Data Collection Pipeline
Data Processing Pipeline
Text
Tweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce
71. Incident Reviews
Ask the key questions:
• What went wrong?
• How could we have detected it sooner?
• How could we have prevented it?
• How can we prevent this class of
problem in the future?
• How can we improve our behavior for
next time?
Tweet @jedberg with feedback!
72. Best Practices for Data
• Have multiple copies of all data
• Keep those copies in multiple AZs
• Avoid keeping state on a single instance
• Take frequent snapshots of EBS disks
• No secret keys on the instance
Tweet @jedberg with feedback!
73. Netflix autoscaling
2
Deployment
Text
1
Traffic Peak
Tweet @jedberg with feedback!
74. AWS Usage
Dollar amounts have been carefully removed
Tweet @jedberg with feedback!
78. Leveraging Multi-region
• 100% uptime is theoretically possible.
• You have to replicate your data
• This will cost money
Tweet @jedberg with feedback!
79. Circuit Breakers (Hystrix)
Be liberal in what you accept, strict in what you send
Tweet @jedberg with feedback!
80. Just a quick reminder...
• (Some of) Netflix is open source:
• https://github.com/netflix
Tweet @jedberg with feedback!
81. We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.
Please fill out an evaluation
form when you have a
chance.
83. Getting in touch
Email: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg
Tweet @jedberg with feedback!
Editor's Notes
My friends Joe and Carl already told you about Nac and our build system. This allows the devs to take control of their deployment. Each team is responsible for their own deployments and uptime. When something breaks, we have a system that lets us page a team who then gets on and fixes their stuff. Each team is responsible for their own destiny. So how do we stay reliable when we have no control? Information.
Automate as much as you can
The more automated things are, the easier it is to be a sysadmin. Application startup Configuration Code deployment Full system deployment The more automated things are, the easier it is to scale especially in a virtualized environment with auto-scaling And virtualized computing added the last bit, the ability to automate system deployment. (Ok, that ’ s not entirely true, but watch me wave my hands and say it is)
In most places, you have this. Standard image with tools to manage the systems and the deployment.
By building for three, you can reasonably lose one of your instances and still be stable.
replication factor quorum reads / writes
In most systems, you worry about the software and installing it on an OS. At Netflix, the smallest thing we worry about is the instance image, which lives in a cluster. We ’ ve essentially built a platform for doing automated deployment of Java code (and some Python too!)
So, why do we bake custom images instead of just using Puppet or Chef to deploy packages dynamically to launched generic machines? We like to front-load the full machine assembly to build time, instead of waiting until deployment time. We do this because: • More Reliable: less systems that can fail at deploy time right when we need them most. • Faster Launch: means quicker reaction to load increases, e.g. autoscaling up can be more precise. • Single image: produces exactly homogeneous clusters. No file/package version skew across machines in a cluster
OK, back to the build pipeline again. We have a vague “ app bundles ” output in this diagram. Let ’ s delve into how we manage the application bundle artifacts in more detail.
The first step of the baking process is to create the “ base ” image that we will use for baking all app images. This is done once every week or two. We start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages: • Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc. The end result is a beefed-up OS image that is ready to go, and just needs an app added.
The first step of the baking process is to create the “ base ” image that we will use for baking all app images. This is done once every week or two. We start with a standard Linux distro as a foundation (CentOS now, Ubuntu on the way), and add in our favorite, our custom and customized packages: • Apache, Java (JDK 6 and 7), Tomcat, Perl, Python, provisioning and startup scripts, log management tools, monitoring agents, etc. The end result is a beefed-up OS image that is ready to go, and just needs an app added.
Gateway classifies and routes events based on severity and the systems involved. The gateway currently processes around 48K events a day
(step through) here are some best practices we ’ ve learned over the last year with EC2. Some of these we follow well, and some we need to follow better.
At Netflix we use autoscaling the help manage reliability and cost. Here is one of our clusters scaling up and down. We are tuning for the holidays, so you can see parts where we are doing squeeze tests and adjusting the scaling speed and values.
Amazon will help you as well. One way they do this is by providing zones. Each zone is like an island that is loosely connected to the other zones, but mostly distinct.
So how do you get better than 99.95% uptime? Multiple zones! By spreading your systems out across multiple zones, you should be able to withstand the failure of one zone. In a little bit, I ’ ll go over how reddit and Netflix used a multizone strategy to survive outages.
Amazon, as well as other providers, offer multiple regions as well. Regions are essentially like separate providers with the same featureset. Your data does not get shared across regions
You can contact me in one of these ways, or ask your question now. thank you.