Devops at Netflix (re:Invent)

Rainmakers
How Netflix Operates Clouds for Maximum Freedom and Agility

Jeremy Edberg
Reliability Architect, Netflix

Do you have...

• A release Engineer?

• A QA department?

• Chef or Puppet to
manage your systems?

Tweet @jedberg with feedback!

Do you have...

• Upwards of 100 releases a day?


With more than 30 million streaming members in
the United States, Canada, Latin America, the
United Kingdom, Ireland and the Nordics, Netflix is
the world's leading internet subscription service for
enjoying movies and TV programs streamed over
the internet to PCs, Macs and TV.
Source: http://ir.netflix.com


The Netflix Way
• Everything is “built for three”

• Fully automated build tools to test and
make packages

• Fully automated machine image bakery

• Fully automated image deployment

• Independent teams responsible for
both Dev and Ops


Philosophy


Automate all the things!


Automate all the things!

• Application startup

• Configuration

• Code deployment

• System deployment


Automation

• Standard base image

• Tools to manage all the systems

• Automated code deployment


Shared state should be
stored in a shared service

Data on an instance should
be replicated to other
instances

“Build for Three”
We hold a boot camp for new engineers to teach them how
to build for a highly distributed environment.


Netflix on AWS
2012 2012 2012
IPv6 IPv6 IPv6

Open Connect


Highly aligned, loosely coupled

• Services are built by different teams
who work together to figure out what
each service will provide.

• The service owner publishes an API
that anyone can use.


Advantages to a Service
Oriented Architecture
• Easier auto-scaling

• Easier capacity planning

• Identify problematic code-paths more easily

• Narrow in the effects of a change

• More efficient local caching


Freedom and Responsibility

• Developers deploy when they want

• They also manage their own capacity
and autoscaling

• And fix anything that breaks at 4am!


All systems choices assume
some part will fail at some
point.


The Monkey Theory

• Simulate things
that go wrong
• Find things that
are different


Execution

Photo from I, Robot, copyright 20th Century Fox

Netflix built a global PaaS

• Service Oriented
Architecture
• HTTP/Rest interfaces
between services


Netflix PaaS features
• Supports all regions and zones

• Multiple accounts

• Cross region/account replication

• Internationalized, localized and GeoIP routed

• Advanced key management

• Autoscaling with 1000s of instances

• Monitoring and alerting on millions of metrics

What AWS Provides
• Instances

• Machine Images

• Elastic IPs

• Load Balancers

• Security groups / Autoscaling groups

• Availability zones and regions


Linux Base AMI (CentOS or Ubuntu)

Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent

Monitoring monitoring Tomcat
Log Rotation
to S3 Application war file, base Healthcheck, status
GC and servlet, platform, interface servelets, JMX interface,
Appdynamics thread dump jars for dependent services Servo autoscale
Machine Agent logging


The Netflix Platform
Discovery
(Eureka)Entrypoints Circut Breakers (Hystrix)
(Edda)Configuration Cassandra (Priam &
(Archaius) Astyanax & CassJMeter)
Zookeeper (Exhibitor) Cryptex
logging (Blitz4j & Honu) AKMSEvCache
NIWS Proxiesi18n
Geo L10n
Base Open Source

N
ov C
D r u ra
e to
c
20
12 A
x sty
Fe an
b S

o er a
M Pr v
ar m ia
C
A e r as
sJ
pr
Ex M
M r hi
b
et
a
y ito
Ju
n A
s rch
Ju A a
l d sg iu
ar
C
A
Open Source at Netflix

M ha
Edda
Blitz4j

ug
Hystrix

on os
ke
Governator

Se y
p Eu
a re
O k
ct

Finding things
• Discovery (Eureka)
• Application to instance mapping
• Heartbeat to keep track of health
• Entrypoints (Edda)
• Local database of AWS resources
• NIWS (Netflix Internal Web Service)
• On instance software load balancer
• Handles retry logic
• Geo (Geolocation library)
• Provides IP to Lat/Lon mapping for any service that needs it.


Entrypoints (Edda)

• REST API
• GET /REST/v2/instance/$id

• Keeps track of all resources

• Autoscaling groups, EIPs, Instances,
Applications, Clusters, History


Entrypoints Exploration
Find all active instances GET /REST/v2/view/instances

Find all instances in a GET /REST/v2/group/clusters
cluster

Show only ASG name, /v2/aws/autoScalingGroups/edda-v123;_pp:
(autoScalingGroupName,instances:
instance ID and health (instanceId,lifecycleState))

Which ASG contains a /v2/aws/autoScalingGroups;instances.instanceId=i-
96f3ca3a
particular instance?


Keeping it all Straight
• Configuration (Archaius)
• Global variables (Fast properties)
• Base
• Base system. Prod vs. Test, etc
• Zookeeper (Curator)
• Locks, other similar coordination
• Logging (Blitz4j and Honu)
• Keep track of what happened and store it for
post analysis.

Keeping it Secure
• Cryptex

• Service for key management

• High, medium and low value keys

• AKMS (Amazon Key Management System)

• Hands out keys to instances (and dev boxes) so
they don’t have to store the key on the instance

Tweet @jedberg with feedback! For more info, see SEC201: Security Panel

Storing it
• Cassandra (Priam, astyanax)
• Configure and access Cassandra
• Provide OO abstractions handle
connection pooling, discovery of hosts
• EVCache (Eccentric Volatile Cache)
• Wrapper for memcached to handle zone
awareness and replication
• Proxies
• Get data out of the datacenter and into
the cloud.

Data
What do we do with it all?


We store it!

• Cache (memcached)
• Cassandra
• RDS (MySql)

Cassandra


Why Cassandra?

• Availability over consistency
• Writes over reads
• We know Java
• Open source + support

Using Cassandra at Netflix
• Priam
• Zero touch auto-config
• State management
• Token assignment
• Node replacement
• Backup/restore to/from S3
• Astyanax
• OO abstraction to Cassandra
• Multi-region support


Cassandra Architecture


Cassandra Architecture

Tweet @jedberg with feedback! For more info, see DAT202: Optimizing your Cassandra Database on AWS

Tools
• Asgard

• AWS usage

• Atlas

• Chronos

• Build system

• Explorers (Cassandra and SimpleDB)


Elastic Load
Balancer
Auto Scaling
Group

Security
Instances
Group

Launch
Configuration

Amazon Machine
Tweet @jedberg with feedback! Image

api-frontend

api-usprod-v007 api-usprod-v008


Netflix has moved the
granularity from the
instance to the cluster


Why Bake?
Traditional:
•launch OS Generic AMI
•install packages Instance

•install app

Netflix:
•launch OS+app
App AMI Instance


Getting Baked
Artifactory
Artifactory app bundles
Ivy
snapshot / release
libraries
libraries / apps

Jenkins
Jenkins resolve
resolve test
test publish
publish

sync
sync compile
compile build
build report
report
source

Perforce / /Git
Perforce Git Ant targets Groovy all over


Base Image
Baking S3 / EBS

foundation
foundation
AMI
AMI
Linux: CentOS, Fedora, Ubuntu
base
base
AMI
AMI
mount snapshot

Ready
for
Yum // Apt
Yum Apt app
install Bakery
Bakery bake
AWS
RPMs: Apache, Java...

ec2 slave instances

App Image
Baking S3 / EBS

base AMI
base AMI
Linux, Apache, Java, Tomcat

app
app
AMI
AMI
mount snapshot

Jenkins // Yum //
Jenkins Yum Ready
Artifactory
Artifactory
to launch!
install Bakery
Bakery
AWS
app bundle

ec2 slave instances


Optional
Java (JDK 6 or 7)
Apache
Appdynamics
App Agent

Monitoring monitoring JBoss
Log Rotation
to S3 Application war file, base Healthcheck, status
GC and servlet, platform, interface servelets, JMX interface,
Appdynamics thread dump jars for dependent services Servo autoscale
Machine Agent logging



Optional
Python
Apache

monitoring
Monitoring Django
Log Rotation
to S3 Application file, base
server, platform, interface
Appdynamics logging libs for dependent services
Machine Agent


The simian army
• Chaos -- Kills random instances

• Chaos Gorilla -- Kills zones

• Chaos Kong -- Kills regions

• Latency -- Degrades network and injects faults

• Conformity -- Looks for outliers

• Circus -- Kills and launches instances to maintain zone balance

• Doctor -- Fixes unhealthy resources

• Janitor -- Cleans up unused resources

• Howler -- Yells about bad things like Amazon limit violations

• Security -- Finds security issues and expiring certificates
Tweet @jedberg with feedback! For more info, see ARC301: Intro to Chaos Monkey & the Simian Army

What’s going on?!


Atlas


{
  "clusters": [
    "epic_aggregator",
    "epic_aggregator-dev"
  ], {
  "alerts": [       "metricName": "EpicPlugin_MetricCount",
    // you can use javascript style comments in the config       "applyTo": "instance",
    {       "description": "${instanceId} is reporting too many metrics",
      "metricName": "EpicPlugin_NumDropped",       "condition": {
      "applyTo": "cluster",         "type": "NumOccurrences",
      "condition": {         "num": 4,
        "type": "StaticThreshold",         "condition": {
        "max": 0.0           "type": "StaticThreshold",
      },           "max": 0.0
      "severity": "major",         }
      "description": "plugin is dropping metrics"       },
    },       "additionalDetails": {
    {         "statusUrl": "http://${publicDnsName}:7001/Status",
      "metricName": "EpicPlugin_NumDropped_Instance",         "nacClusterUrl": "nac${env}/${region}/cluster/show/${cluster}"
      "applyTo": "instance",       }
      "condition": {       "overrides": {
        "type": "NumOccurrences",         "subject": "${instanceId} is reporting too many metrics",
        "num": 4,         "incident_key": "${metricName}:${instanceId}",
        "condition": {         "service_key_override": "12345",
          "type": "StaticThreshold",         "email_override": "devnull@netflix.com"
          "max": 0.0       },
        }       "severity": "minor"
      },     }
      "overrides": {   ]
        "service_key_override": "12345", }
        "require_instance_status_not_in: ["DOWN", "OUT_OF_SERVICE"],
        "email_override": "devnull@netflix.com"
      },
      "severity": "minor"
    },


Example Alert Config


Alert Tuning


Alert Systems
CORE
CORE
Atlas Event
Event
Paging
Paging
Service
Gateway Service
alerting
Gateway
alerting

CORE
CORE
Appdynamics Agent Amazon
Amazon
Agent SES
api
SES
api

CORE
CORE
Agent
Agent
api
api

Other
Other
Team’ss
Team’
Agent
Agent


Chronos


Data Collection Pipeline

Data Processing Pipeline
Text

Tweet @jedberg with feedback! For more info, see BDT303: Data Science with Elastic MapReduce

Chuckwa/Honu messages / min

63 billion
messages a
day


Best Practices


Incident Reviews
Ask the key questions:

• What went wrong?

• How could we have detected it sooner?

• How could we have prevented it?

• How can we prevent this class of
problem in the future?

• How can we improve our behavior for
next time?

Best Practices for Data
• Have multiple copies of all data
• Keep those copies in multiple AZs
• Avoid keeping state on a single instance
• Take frequent snapshots of EBS disks
• No secret keys on the instance


Netflix autoscaling
2
Deployment

Text
1

Traffic Peak


AWS Usage
Dollar amounts have been carefully removed


Going multi-zone


Benefits of Amazon’s Zones

• Loosely connected

• Low latency between zones

• 99.95% uptime guarantee per region


Going Multi-region


Leveraging Multi-region

• 100% uptime is theoretically possible.

• You have to replicate your data

• This will cost money


Circuit Breakers (Hystrix)
Be liberal in what you accept, strict in what you send


Just a quick reminder...

• (Some of) Netflix is open source:

• https://github.com/netflix


We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.

Please fill out an evaluation
form when you have a
chance.

Questions?


Getting in touch
Email: jedberg@{gmail,netflix}.com
Twitter: @jedberg
Web: www.jedberg.net
Facebook: facebook.com/jedberg
Linkedin: www.linkedin.com/in/jedberg

Devops at Netflix (re:Invent)

More Related Content

Devops at Netflix (re:Invent)

Editor's Notes