Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study

How Netflix Leverages Multiple Regions to Increase
Availability: Isthmus and Active-Active Case Study
Ruslan Meshenberg
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Assumptions
Everything Is Broken

Hardware Will Fail
•
•

•
•

Telcos
Scale

Enterprise IT
•
•

Rapid Change
Large Scale

•
•

Slowly Changing
Large Scale

Rapid Change
Small Scale

Web Scale
Startups

Slowly Changing
Small Scale

Everything Works

Software Will Fail
Speed

Incidents – Impact and Mitigation
Public relations
media impact
High customer
service calls
Affects AB
test results

PR
X Incidents

Y incidents mitigated by Active
Active, game day practicing

CS
XX Incidents
Metrics Impact – Feature Disable
XXX Incidents

YY incidents
mitigated by better
tools and practices
YYY incidents
mitigated by better
data tagging

No Impact – Fast Retry or Automated Failover
XXXX Incidents

Does an Instance Fail?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey

Does a Zone Fail?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla

Does a Region Fail?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issue

• Test with Chaos Kong

Everything Fails… Eventually
• The good news is you can do something about it
• Keep your services running by embracing
isolation and redundancy

Cloud Native
A New Engineering Challenge
Construct a highly agile and highly
available service from ephemeral and
assumed broken components

Isolation
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not
affect functionality / operations

Redundancy
• Make more than one (of pretty much everything)
• Specifically, distribute services across
Availability Zones and regions

History: X-mas Eve 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process that
was inadvertently run against the production ELB
state data”
• “The ELB issue affected less than 7% of running
ELBs and prevented other ELBs from scaling.”

Isthmus – Normal Operation

US-East ELB

US-West-2 ELB

ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Isthmus – Failover

US-East ELB

US-West-2 ELB

ELB
Zuul
Infrastructure*

Zone A

Tunnel

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Zuul – Overview

Elastic Load Balancing



Denominator – Abstracting the DNS Layer

Amazon
Route 53

DynECT
DNS

UltraDNS

Denominator

Regional Load Balancers

ELBs
Zone A
Cassandra
Replicas

Zone B
Cassandra
Replicas

Zone C
Cassandra
Replicas

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Isthmus – Only for Elastic Load Balancing Failures

• Other services may fail region-wide
• Not worthwhile to develop one-offs for each one

Active-Active – Full Regional Resiliency



Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Active-Active – Failover



Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Separating the Data – Eventual Consistency
• 2–4 region Cassandra clusters
• Eventual consistency != hopeful consistency

Highly Available NoSQL Storage

A highly scalable, available, and
durable deployment pattern based on
Apache Cassandra

Benchmarking Global Cassandra
Write intensive test of cross-region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
192 TB of SSD in six locations up and running Cassandra in 20 minutes
Test
Load

Validation
Load

1 Million Reads
after 500 ms
CL.ONE with No
Data Loss

US-West-2 Region - Oregon

Test
Load

1 Million Writes
CL.ONE (Wait for
One Replica to ack)

US-East-1 Region - Virginia

Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Interzone Traffic

Interregion Traffic
Up to 9Gbits/s, 83ms

18 TB backups
from S3

Propagating EVCache Invalidations
4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch.
Goes cross-region through ELB over HTTPS

Drainer
Writer

7. Return keys
that were successful

3. Read
from SQS
in batches

Drainer
Writer

EVCache Replication
Service

EVCache Replication
Service

6. Deletes the value for the
key
EVCache Replication
Metadata

5. Checks write time to ensure
this is the latest operation for
the key

US-WEST-2

EVCAC
EVCAC
EVCACHE
HE
HE

8. Delete keys from SQS that were
successful
1. Set data in
EVCACHE

EVCACH
EVCACH
EE
EVCACHE

EVCache Replication
Metadata

2. Write events to SQS and
EVCACHE_REGION_REPLICATION

SQS
EVCache
Client

App Server

US-EAST-1

Archaius – Region-isolated Configuration

Running Isthmus and Active-Active

Multiregional Monkeys
• Detect failure to deploy
• Differences in configuration
• Resource differences

Multiregional Monitoring and Alerting
• Per region metrics
• Global aggregation
• Anomaly detection

Failover and Fallback
• DNS (denominator) changes
• For fallback, ensure data consistency
• Some challenges
– Cold cache
– Autoscaling

• Automate, automate, automate

Validating the Whole Thing Works

Dev-Ops in N Regions
• Best practices: avoiding peak times for
deployment
• Early problem detection / rollbacks
• Automated canaries / continuous delivery

Hyperscale Architecture

Amazon
Route 53

DynECT
DNS

UltraDNS

DNS
Automation



Zone A

Zone B

Zone C

Zone A

Zone B

Zone C

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Cassandra Replicas

Building Blocks Available on Netflix Github site

Topic

Session #

When

What an Enterprise Can Learn from Netflix, a Cloud-native Company

ENT203

Thursday, Nov 14, 4:15 PM - 5:15 PM

Maximizing Audience Engagement in Media Delivery

MED303


Scaling your Analytics with Amazon Elastic MapReduce

BDT301


Automated Media Workflows in the Cloud

MED304


Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce
for Monitoring at Gigascale

BDT302


Encryption and Key Management in AWS

SEC304

Friday, Nov 15, 9:00 AM - 10:00 AM

Your Linux AMI: Optimization and Performance

CPN302

Friday, Nov 15, 11:30 AM - 12:30 PM

Takeaways
Embrace isolation and redundancy for availability
NetflixOSS helps everyone to become cloud native
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
@rusmeshenberg @NetflixOSS

We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.

Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study

More Related Content

Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study

Editor's Notes