Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Capacity Management: For Web Operations

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 57

Capacity Management

for Web Operations

John Allspaw Operations Engineering

the book Im writing

???

Rules of Thumb Planning/Forecasting Stupid Capacity Tricks

(with some Flickr statistics sprinkled in)

Things that can cause downtime


bugs (disguised as capacity problems)

edge cases (disguised as capacity


problems)

security incidents
real capacity problems*
* (should be the last thing you need to worry about)

Capacity != Performance

Forget about performance for right now

Measure what you have right NOW


Dont count on it getting any better

Thank You HPC Industry!


Automated Stuff
Scalable Metric Collection/Display

a lot of great deployment and management tricks come from them, adopted by web ops

Good Measuremen t Tools


record and store metrics in/out custom metrics easily compare lightweight-ish

Clouds need planning too


Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before

Metrics
System Statistics

Application Level

Metrics
(photos processed per minute)

(average processing time per phot

(apache requests)

(concurrent busy apache procs)

Metrics
App-level meets system-level

here, total CPU = ~1.12 * # busy apache procs

2400
photos per minute being uploaded right NOW (Tuesday

the most amount of work your resources will allow before degradation or failure

Ceiling s

Forget Benchmarking

Find your ceilings

what you have left

The End

Use real live production data to find ceilings

Production: its like a lab, but bigger!

Like: database ceilings

replication lag: bad!

Ceilings

waiting on disk sustained disk I/O wait for >40% creates too much slave lag*
*for us, YMMV

35,000

oto requests per second on a Tuesday peak

Safety Factors

Safety Factors

Ceiling * Factor of Safety = UR LIMITZ

Safety Factors

webserver!

Safety Factors
what you have left

safe ceiling @85% CPU

85% total CPU = ~76 busy apache procs

Safety Factors
Yahoo Front Page link to Chinese NewYear Photos

(8% spike)

(photo requests/second)

Forecasting

Forecasting

Fictional Example: webservers

Forecasting
peak of the week

Fictional example: 15 webservers. 1 week.

Forecasting

...bigger sample, 6 weeks....isolate the peaks...

Forecasting
not too shabby

now

...Add a Trendline with some decent correlation...

Forecasting
ceiling this will tell you when it is

when is this? what you have left

15 servers @76 busy apache proc limit = 1140 total procs

Forecasting

(1140-726) / 42.751 = 9.68

(week #10, duh)

Forecasting Automation
Writing excel macros is boring
All we want is days remaining, so all we need is the curve-fit

Use http://fityk.sf.net to automate the curve-fit

Forecasting

Fictional Example: storage consumption

Forecasting Automation

this will tell you when this is

actual flickr storage consumption from early 2005, in GB (ceiling is fictional)

Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit
1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...

cmd line script output

Forecasting Automation
fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3

( R2 = 99.84)

(SAME)

Capacity Health
12,629 nagios checks 1314 hosts 6 datacenters 4 photo farms farm = 2 DCs (east/west)

High and Low Water Marks


alert if higher

alert if lower

Per server, squid requests per second

A good dashboard looks something like...


limit/bo ceiling x units limit (total) current (peak) % peak
Est days left

type

www shard db squid

20

80

20
18

40
950

busy 62.50 1600 1000 procs % I/O 27.50 800 220 wait % req/se 66.67 17,100 11,400 c %

36

120
48

(yes, fictional numbers)

Diagonal Scaling
vertically scaling your already horizontal nodes

Image processing machines Replace Dell PE860s with HP DL140G3s

Diagonal Scaling
example: image processing

4 cores

8 cores

(about the same CPU usage per box)

Diagonal Scaling
example: image processing throughput

~45 images/min @ peak

~140 images/min @ peak


(same CPU usage, but ~3x more work) processing means making 4 sizes from originals

Diagonal Scaling
example: image processing
went from:

23

3008.4 Dell PE860s Watts

1035 photos/min

23U rack

to:

1036.8 8U 1120 HP DL140 G3s Watts photos/min rack !!! (75% faster, even)

3.52

terabytes will be consumed today (on a

2nd Order Effects (beware the wandering bottleneck)

running hot, so add more

2nd Order Effects (beware the wandering bottleneck)

now these run hot

running great now, so more traffic!

Stupid Capacity Tricks

Stupid Capacity Tricks


quick and dirty management
DSH http://freshmeat.net/projects/dsh
[root@netmon101 ~]# cat group.of.servers www100

www118
dbcontacts3 admin1 admin2

Stupid Capacity Tricks


quick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

Stupid Capacity Tricks


Turn Stuff OFF

Disable heavy-ish features of the site(on/off switches)

We have 195 different things to disable in case of emergency.

Stupid Capacity Tricks


Turn Stuff OFF
uploads (photo)
uploads (video) uploads by email various API things various mobile things various search things

etc., etc.

Stupid Capacity Tricks


Outages Happen
Host your outage/status/blog page in more than one datacenter.

Tell your users WTF is going on, theyll appreciate it.

Stupid Capacity Tricks


Hit the Pause Button

Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and unbake) at will

thanks
http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

Were Hiring! flickr.com/jobs

Come see me!

questions?

You might also like