Capacity Management: For Web Operations

Capacity Management
for Web Operations
John Allspaw Operations Engineering
the book Im writing
???
Rules of Thumb Planning/Forecasting Stupid Capacity Tricks
(with some Flickr statistics sprinkled in)
Things that can cause downtime

bugs (disguised as capacity problems)
edge cases (disguised as capacity

problems)
security incidents
real capacity problems*
* (should be the last thing you need to worry about)
Capacity != Performance
Forget about performance for right now
Measure what you have right NOW

Dont count on it getting any better
Thank You HPC Industry!

Automated Stuff
Scalable Metric Collection/Display
a lot of great deployment and management tricks come from them, adopted by web ops
Good Measuremen t Tools

record and store metrics in/out custom metrics easily compare lightweight-ish
Clouds need planning too

Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before
Metrics
System Statistics
Application Level
Metrics
(photos processed per minute)
(average processing time per phot
(apache requests)
(concurrent busy apache procs)
Metrics
App-level meets system-level
here, total CPU = ~1.12 * # busy apache procs
2400
photos per minute being uploaded right NOW (Tuesday
the most amount of work your resources will allow before degradation or failure
Ceiling s
Forget Benchmarking
Find your ceilings
what you have left
The End
Use real live production data to find ceilings
Production: its like a lab, but bigger!
Like: database ceilings
replication lag: bad!
Ceilings
waiting on disk sustained disk I/O wait for >40% creates too much slave lag*
*for us, YMMV
35,000
oto requests per second on a Tuesday peak
Safety Factors
Safety Factors
Ceiling * Factor of Safety = UR LIMITZ
Safety Factors
webserver!
Safety Factors
what you have left
safe ceiling @85% CPU
85% total CPU = ~76 busy apache procs
Safety Factors
Yahoo Front Page link to Chinese NewYear Photos
(8% spike)
(photo requests/second)
Forecasting
Forecasting
Fictional Example: webservers
Forecasting
peak of the week
Fictional example: 15 webservers. 1 week.
Forecasting
...bigger sample, 6 weeks....isolate the peaks...
Forecasting
not too shabby
now
...Add a Trendline with some decent correlation...
Forecasting
ceiling this will tell you when it is
when is this? what you have left
15 servers @76 busy apache proc limit = 1140 total procs
Forecasting
(1140-726) / 42.751 = 9.68
(week #10, duh)
Forecasting Automation
Writing excel macros is boring
All we want is days remaining, so all we need is the curve-fit
Use http://fityk.sf.net to automate the curve-fit
Forecasting
Fictional Example: storage consumption
this will tell you when this is
actual flickr storage consumption from early 2005, in GB (ceiling is fictional)
jallspaw:~]$cfityk ./fit-storage.fit
1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
cmd line script output
fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3
( R2 = 99.84)
(SAME)
Capacity Health
12,629 nagios checks 1314 hosts 6 datacenters 4 photo farms farm = 2 DCs (east/west)
High and Low Water Marks

alert if higher
alert if lower
Per server, squid requests per second
A good dashboard looks something like...

limit/bo ceiling x units limit (total) current (peak) % peak
Est days left
type
www shard db squid
20
80
20
18
40
950
busy 62.50 1600 1000 procs % I/O 27.50 800 220 wait % req/se 66.67 17,100 11,400 c %
36
120
48
(yes, fictional numbers)
Diagonal Scaling
vertically scaling your already horizontal nodes
Image processing machines Replace Dell PE860s with HP DL140G3s
Diagonal Scaling
example: image processing
4 cores
8 cores
(about the same CPU usage per box)
Diagonal Scaling
example: image processing throughput
~45 images/min @ peak
~140 images/min @ peak

(same CPU usage, but ~3x more work) processing means making 4 sizes from originals
Diagonal Scaling
example: image processing
went from:
23
3008.4 Dell PE860s Watts
1035 photos/min
23U rack
to:
1036.8 8U 1120 HP DL140 G3s Watts photos/min rack !!! (75% faster, even)
3.52
terabytes will be consumed today (on a
2nd Order Effects (beware the wandering bottleneck)
running hot, so add more
2nd Order Effects (beware the wandering bottleneck)
now these run hot
running great now, so more traffic!
Stupid Capacity Tricks

quick and dirty management
DSH http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers www100
www118
dbcontacts3 admin1 admin2

quick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

Turn Stuff OFF
Disable heavy-ish features of the site(on/off switches)
We have 195 different things to disable in case of emergency.

Turn Stuff OFF
uploads (photo)
uploads (video) uploads by email various API things various mobile things various search things
etc., etc.

Outages Happen
Host your outage/status/blog page in more than one datacenter.
Tell your users WTF is going on, theyll appreciate it.

Hit the Pause Button
Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and unbake) at will
thanks
http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
Were Hiring! flickr.com/jobs
Come see me!
questions?

Capacity Management: For Web Operations

Uploaded by

Copyright:

Available Formats

Capacity Management: For Web Operations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Capacity Management: For Web Operations

Uploaded by

Copyright:

Available Formats

Capacity Management

for Web Operations

John Allspaw Operations Engineering

the book Im writing

Rules of Thumb Planning/Forecasting Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

Things that can cause downtime

edge cases (disguised as capacity

Forget about performance for right now

Measure what you have right NOW

Thank You HPC Industry!

Good Measuremen t Tools

Clouds need planning too

(average processing time per phot

(concurrent busy apache procs)

here, total CPU = ~1.12 * # busy apache procs

Find your ceilings

what you have left

Use real live production data to find ceilings

Production: its like a lab, but bigger!

Like: database ceilings

replication lag: bad!

oto requests per second on a Tuesday peak

Ceiling * Factor of Safety = UR LIMITZ

safe ceiling @85% CPU

85% total CPU = ~76 busy apache procs

Fictional Example: webservers

Fictional example: 15 webservers. 1 week.

...bigger sample, 6 weeks....isolate the peaks...

...Add a Trendline with some decent correlation...

when is this? what you have left

15 servers @76 busy apache proc limit = 1140 total procs

(1140-726) / 42.751 = 9.68

(week #10, duh)

Use http://fityk.sf.net to automate the curve-fit

Fictional Example: storage consumption

this will tell you when this is

actual flickr storage consumption from early 2005, in GB (ceiling is fictional)

cmd line script output

High and Low Water Marks

Per server, squid requests per second

A good dashboard looks something like...

www shard db squid

(yes, fictional numbers)

Image processing machines Replace Dell PE860s with HP DL140G3s

(about the same CPU usage per box)

~45 images/min @ peak

~140 images/min @ peak

3008.4 Dell PE860s Watts

terabytes will be consumed today (on a

2nd Order Effects (beware the wandering bottleneck)

running hot, so add more

2nd Order Effects (beware the wandering bottleneck)

now these run hot

running great now, so more traffic!

Stupid Capacity Tricks

Stupid Capacity Tricks

Stupid Capacity Tricks