Learn how to make your services more resilient and available by embracing principles of isolation and redundancy. See details of 2 projects - Isthmus and Active/Active to learn how Netflix architects for availability in multi-regional environment.
Report
Share
Report
Share
1 of 39
More Related Content
Arc305 how netflix leverages multiple regions to increase availability an isthmus and active-active case study
3. Assumptions
Everything Is Broken
Hardware Will Fail
•
•
•
•
Telcos
Scale
Enterprise IT
•
•
Rapid Change
Large Scale
•
•
Slowly Changing
Large Scale
Rapid Change
Small Scale
Web Scale
Startups
Slowly Changing
Small Scale
Everything Works
Software Will Fail
Speed
4. Incidents – Impact and Mitigation
Public relations
media impact
High customer
service calls
Affects AB
test results
PR
X Incidents
Y incidents mitigated by Active
Active, game day practicing
CS
XX Incidents
Metrics Impact – Feature Disable
XXX Incidents
YY incidents
mitigated by better
tools and practices
YYY incidents
mitigated by better
data tagging
No Impact – Fast Retry or Automated Failover
XXXX Incidents
5. Does an Instance Fail?
• It can, plan for it
• Bad code / configuration pushes
• Latent issues
• Hardware failure
• Test with Chaos Monkey
6. Does a Zone Fail?
• Rarely, but happened before
• Routing issues
• DC-specific issues
• App-specific issues within a zone
• Test with Chaos Gorilla
7. Does a Region Fail?
• Full region – unlikely, very rare
• Individual Services can fail region-wide
• Most likely, a region-wide configuration issue
• Test with Chaos Kong
8. Everything Fails… Eventually
• The good news is you can do something about it
• Keep your services running by embracing
isolation and redundancy
9. Cloud Native
A New Engineering Challenge
Construct a highly agile and highly
available service from ephemeral and
assumed broken components
10. Isolation
• Changes in one region should not affect others
• Regional outage should not affect others
• Network partitioning between regions should not
affect functionality / operations
11. Redundancy
• Make more than one (of pretty much everything)
• Specifically, distribute services across
Availability Zones and regions
12. History: X-mas Eve 2012
• Netflix multi-hour outage
• US-East1 regional Elastic Load Balancing issue
• “...data was deleted by a maintenance process that
was inadvertently run against the production ELB
state data”
• “The ELB issue affected less than 7% of running
ELBs and prevented other ELBs from scaling.”
13. Isthmus – Normal Operation
US-East ELB
US-West-2 ELB
ELB
Zuul
Infrastructure*
Zone A
Tunnel
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
14. Isthmus – Failover
US-East ELB
US-West-2 ELB
ELB
Zuul
Infrastructure*
Zone A
Tunnel
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
18. Denominator – Abstracting the DNS Layer
Amazon
Route 53
DynECT
DNS
UltraDNS
Denominator
Regional Load Balancers
ELBs
Zone A
Cassandra
Replicas
Zone B
Cassandra
Replicas
Zone C
Cassandra
Replicas
Zone A
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
19. Isthmus – Only for Elastic Load Balancing Failures
• Other services may fail region-wide
• Not worthwhile to develop one-offs for each one
20. Active-Active – Full Regional Resiliency
Regional Load Balancers
Regional Load Balancers
Zone A
Zone B
Zone C
Zone A
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
21. Active-Active – Failover
Regional Load Balancers
Regional Load Balancers
Zone A
Zone B
Zone C
Zone A
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
23. Separating the Data – Eventual Consistency
• 2–4 region Cassandra clusters
• Eventual consistency != hopeful consistency
24. Highly Available NoSQL Storage
A highly scalable, available, and
durable deployment pattern based on
Apache Cassandra
25. Benchmarking Global Cassandra
Write intensive test of cross-region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total
192 TB of SSD in six locations up and running Cassandra in 20 minutes
Test
Load
Validation
Load
1 Million Reads
after 500 ms
CL.ONE with No
Data Loss
US-West-2 Region - Oregon
Test
Load
1 Million Writes
CL.ONE (Wait for
One Replica to ack)
US-East-1 Region - Virginia
Zone A
Zone B
Zone C
Zone A
Zone B
Zone C
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Cassandra Replicas
Interzone Traffic
Interregion Traffic
Up to 9Gbits/s, 83ms
18 TB backups
from S3
26. Propagating EVCache Invalidations
4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch.
Goes cross-region through ELB over HTTPS
Drainer
Writer
7. Return keys
that were successful
3. Read
from SQS
in batches
Drainer
Writer
EVCache Replication
Service
EVCache Replication
Service
6. Deletes the value for the
key
EVCache Replication
Metadata
5. Checks write time to ensure
this is the latest operation for
the key
US-WEST-2
EVCAC
EVCAC
EVCACHE
HE
HE
8. Delete keys from SQS that were
successful
1. Set data in
EVCACHE
EVCACH
EVCACH
EE
EVCACHE
EVCache Replication
Metadata
2. Write events to SQS and
EVCACHE_REGION_REPLICATION
SQS
EVCache
Client
App Server
US-EAST-1
33. Dev-Ops in N Regions
• Best practices: avoiding peak times for
deployment
• Early problem detection / rollbacks
• Automated canaries / continuous delivery
37. Topic
Session #
When
What an Enterprise Can Learn from Netflix, a Cloud-native Company
ENT203
Thursday, Nov 14, 4:15 PM - 5:15 PM
Maximizing Audience Engagement in Media Delivery
MED303
Thursday, Nov 14, 4:15 PM - 5:15 PM
Scaling your Analytics with Amazon Elastic MapReduce
BDT301
Thursday, Nov 14, 4:15 PM - 5:15 PM
Automated Media Workflows in the Cloud
MED304
Thursday, Nov 14, 5:30 PM - 6:30 PM
Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce
for Monitoring at Gigascale
BDT302
Thursday, Nov 14, 5:30 PM - 6:30 PM
Encryption and Key Management in AWS
SEC304
Friday, Nov 15, 9:00 AM - 10:00 AM
Your Linux AMI: Optimization and Performance
CPN302
Friday, Nov 15, 11:30 AM - 12:30 PM
38. Takeaways
Embrace isolation and redundancy for availability
NetflixOSS helps everyone to become cloud native
http://netflix.github.com
http://techblog.netflix.com
http://slideshare.net/Netflix
@rusmeshenberg @NetflixOSS
39. We are sincerely eager to hear
your feedback on this
presentation and on re:Invent.
Please fill out an evaluation form
when you have a chance.
Editor's Notes
We’ve built cross vendor DNS automationNetflixOSS Denominator - Global deployment in minutes, robust, agile, denormalized, NoSQL. Triple replicated across availability zones, remote replication across regions. Ability to survive failure of any one zone with no impact. Loss of a whole region and half the customers are redirected to the working region. Failure of a DNS vendor, switch config to a different vendor. Proven scale to well over 10,000 instances with code pushes and autoscaling varying by thousands a day.