Capacity Management for Web Operations

44 %
56 %
Information about Capacity Management for Web Operations

Published on June 25, 2008

Author: jallspaw

Source: slideshare.net

Capacity Management for Web Operations John Allspaw Operations Engineering

the book I’m writing

???

Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)

Things that can cause downtime bugs (disguised as capacity problems) edge cases (disguised as capacity problems) security incidents real capacity problems* * (should be the last thing you need to worry about)

Capacity != Performance Forget about performance for right now Measure what you have right NOW Don’t count on it getting any better

Thank You HPC Industry! Automated Stuff Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops

Good Measurement Tools record and store metrics in/out custom metrics easily compare lightweight-ish I

Clouds need planning too Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before

Metrics System Statistics

Metrics “Application” Level (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)

Metrics App-level meets system-level here, total CPU = ~1.12 * # busy apache procs (ymmv)

2400 photos per minute being uploaded right NOW (Tuesday afternoon)

Ceilings the most amount of “work” your resources will allow before degradation or failure

Forget Benchmarking

Find your ceilings what you have left The End

Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”

Like: database ceilings replication lag: bad!

Ceilings waiting on disk sustained disk I/O wait for too much >40% creates slave lag* *for us,YMMV

35,000 photo requests per second on a Tuesday peak

Safety Factors

Safety Factors Ceiling * Factor of Safety = UR LIMITZ

Safety Factors webserver!

Safety Factors what you have left “safe” ceiling @85% CPU 85% total CPU = ~76 busy apache procs

Safety Factors Yahoo Front Page link to Chinese NewYear Photos (8% spike) (photo requests/second)

Forecasting

Forecasting Fictional Example: webservers

Forecasting peak of the week Fictional example: 15 webservers. 1 week.

Forecasting ...bigger sample, 6 weeks....isolate the peaks...

Forecasting not too shabby now ...”Add a Trendline” with some decent correlation...

Forecasting this will tell you when it is ceiling when is this? what you have left 15 servers @76 busy apache proc limit = 1140 total procs

Forecasting (1140-726) / 42.751 = 9.68 (week #10, duh)

Forecasting Automation Writing excel macros is boring All we want is “days remaining”, so all we need is the curve-fit Use http://fityk.sf.net to automate the curve-fit

Forecasting Fictional Example: storage consumption

Forecasting Automation this will tell you when this is actual flickr storage consumption from early 2005, in GB (ceiling is fictional)

Forecasting Automation jallspaw:~]$cfityk ./fit-storage.fit cmd line script 1> # Fityk script. Fityk version: 0.8.2 output 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...

Forecasting Automation fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3 ( R2 = 99.84) (SAME)

Capacity Health 12,629 nagios checks 1314 hosts 6 datacenters 4 photo “farms” farm = 2 DCs (east/west)

High and Low Water Marks alert if higher alert if lower Per server, squid requests per second

A good dashboard looks something like... Est limit/ ceiling limit current % days type # box units (total) (peak) peak left busy www 20 80 1600 1000 62.50% 36 procs shard I/O 20 40 800 220 27.50% 120 db wait squid 18 950 req/sec 17,100 11,400 66.67% 48 (yes, fictional numbers)

Diagonal Scaling vertically scaling your already horizontal nodes Image processing machines Replace Dell PE860s with HP DL140G3s

Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)

Diagonal Scaling example: image processing throughput ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “processing” means making 4 sizes from originals

Diagonal Scaling example: image processing went from: 3008.4 1035 23U 23 Dell PE860s Watts photos/min rack to: 8 HP DL140 G3s 1036.8 Watts 1120 photos/min 8U rack !!! (75% faster, even)

3.52 terabytes will be consumed today (on a Tuesday)

2nd Order Effects (beware the wandering bottleneck) LB running hot, so add more www www db search memcached

2nd Order Effects (beware the wandering bottleneck) LB running great now, so more traffic! now these run www www www www hot db search memcached

Stupid Capacity Tricks

Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2

Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

Stupid Capacity Tricks Turn Stuff OFF Disable heavy-ish features of the site (on/off switches) We have 195 different things to disable in case of emergency.

Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.

Stupid Capacity Tricks Outages Happen Host your outage/status/blog page in more than one datacenter. Tell your users WTF is going on, they’ll appreciate it.

Stupid Capacity Tricks Hit the Pause Button Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and un- bake) at will

thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

We’re Hiring! flickr.com/jobs Come see me!

questions?

Add a comment

Related presentations

Related pages

Capacity Management for Web Operations - Technology

Improving Capacity Management and Patient Flow Chris Stirling Associate Director of Operations NHS Lothian.
Read more

Capacity Planning - Operations Management

Operations Management . Capacity Planning. Written by Andrew Goldman for Gaebler Ventures. The capacity of your company to meet expected demand should be ...
Read more

ITIL - ITIL - itlibrary.org

... ITIL Service Operation; and ITIL Continual Service Improvement. ... Service Level Management; IT Financial Management; Capacity Management; ...
Read more

Capacity Management for Web Operations John Allspaw ...

document.write(adsense.get_banner_code('200x90')); Slide 1 Capacity Management for Web Operations John Allspaw Operations Engineering Slide 2 the book Im ...
Read more

Capacity Planning & Optimization: Server Virtualization ...

Operations Management Digital business requires that you ... Cloud Optimizer is a unified capacity management and performance tool that ... Web event CIO ...
Read more

SharePoint: Best Practices for Capacity Management for ...

Best practices for capacity management ... Best practices for capacity management for SharePoint ... that can be supported by a given web ...
Read more

Gartner Says Major Organizations Will Need to Grow ...

STAMFORD, Conn., May 8, 2014 View All Press Releases Gartner Says Major Organizations Will Need to Grow Capacity and Performance Management Skills That Are ...
Read more

Capacity and Performance Management: Best ... - Cisco

Capacity and performance management helps network managers ... The operation could affect control ... the three recommended areas of capacity management:
Read more