advertisement

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

67 %
33 %
advertisement
Information about 10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at...
Technology

Published on March 6, 2014

Author: BrianTroutwine1

Source: slideshare.net

Description

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the libraries we use and lessons learned to adapt your organization to deal with the onslaught of data from instrumentation.
advertisement

T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R MONITORING REAL-TIME B I D D I N G AT A D R O L L

I DO THINGS WITH/TO COMPUTERS.

I CARE ABOUT RELIABLE, COMPLEX AND CRITICAL SYSTEMS.

ADROLL

LESS THIS

MORE THIS

WE’RE AN ADTECH C O M PA N Y .

R E TA R G E T I N G , IT’S A L L A B O U T D ATA .

• Our customers want to show ads to people who might care to see them. • We partner with “exchanges” to enter ad-slot auctions. • These auctions are executed and finalized while consumer’s webpages load.

REAL-TIME BIDDING

the nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 30 billion transactions per day ) • Global, 24/7 operation

“HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.” –C ARLOS BUENO “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”

AHEAD OF TIME V E R I F I C AT I O N I S N O T SUFFICIENT. ! (DON’T SCRIMP ON IT, THOUGH.)

IGNORANCE AND COMPLEX INTERACTIONS WITH EXTERNAL SYSTEMS ARE WHY WE CAN’T H AV E N I C E T H I N G S .

AT SCALE, EVEN RARE EVENTS HAPPEN FREQUENTLY.

AT SCALE, BAD THINGS HAPPEN FA S T E R T H A N HUMANS CAN RESPOND.

THE RESULTS ARE RARELY PRETTY.

THE CAUSES ARE RARELY QUICK TO DISCOVER.

W H AT CAN BE DONE?

WE H AV E T O O B S E R V E OUR SYSTEMS WHILE THEY RUN.

WE CAN BE SCIENTIFIC ABOUT THIS

NOT ALL SYSTEMS ARE MONITORABLE, JUST AS NOT A L L A R E T E S TA B L E . ! I T ’ S A M AT T E R O F D E S I G N A N D OF CULTURE.

A DESIGN T H AT I S M O N I T O R A B L E TA K E S S T O C K O F I T S I N T E R N A L TOLERANCES, EXTERNAL I N T E R FA C E S A N D E X P O S E S T H E S E FOR AN O P E R AT O R .

AN ENGINEER IS RESPONSIBLE FOR DESIGNING AND BUILDING SYSTEMS. ! A N O P E R AT O R IS RESPONSIBLE FOR RUNNING THEM.

A H E A D - O F - T I M E V E R I F I C AT I O N —TESTING, TYPE-CHECKING— GIVES THE ENGINEER INTUITION ABOUT THE SYSTEM.

IF YOU DON’T KNOW HOW THE SYSTEM S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T SHOULDN’T OR ISN’T.

MONITORING GIVES THE O P E R AT O R I N F O R M AT I O N A B O U T T H E B E H AV I O R O F THE RUNNING SYSTEM.

T H E O P E R AT O R S A N D E N G I N E E R S ARE OFTEN THE SAME PEOPLE.

THERE’S A POTENTIAL FOR A POSITIVE FEEDBACK LOOP OF QUALITY HERE.

“WHILE THE SKILL OF REENTRY WAS EASILY HANDLED BY AUTOMATED SYSTEMS, THE PILOT’S PRIMARY FUNCTION EVOLVED TO BE A REDUNDANT SYSTEM (…) COORDINATING A VARIETY OF CONTROLS AS MUCH AS DIRECTLY CONTROLLING THE VEHICLE.” – DAV I D A . M I N D E L L D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T

exometer

exometer — github.com/Feuerlabs/exometer • Created by Feuerlabs. • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.

I M P O R TA N T T E R M S • METRIC: a measurement • ENTRY: a receiver and aggregator of metrics • REPORTER: an entity which samples entries on a regular interval and optionally ships these samples onto a third-system • SUBSCRIPTION: the definition of the regular interval on which reporters sample entries

Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !

Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !

Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !

Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}

Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}

Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}

D o i n g i t d y n a m i c a l l y. 1> exometer:new([a, histogram], histogram). ok ! 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} ! 3> exometer_report:add_reporter( exometer_report_tty, []). ok ! 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok ! exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0

W H AT A R E W E L O O K I N G F O R ?

• VM killers • System performance regressions • Abnormal system behavior • Surprises

“ A B N O R M A L S Y S T E M B E H AV I O R ” ?

CASE STUDIES

CASE STUDY: ADROLL RTB TIMEOUT SPIKES

Case Study: AdRoll RTB Timeout Spikes PERIODIC BID TIMEOUTS

Case Study: AdRoll RTB Timeout Spikes CONSISTENT SYSTEM LOAD

Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S

Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D R U N Q U E U E S P I K E S

Case Study: AdRoll RTB Timeout Spikes W H AT HAPPENED?

Case Study: AdRoll RTB Timeout Spikes • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue

CASE STUDY: EXCHANGE THROTTLING

Case Study: Exchange Throttling H E A L T H Y PAT T E R N O F B I D R E Q U E S T S

Case Study: Exchange Throttling THE TROUGH OF THROTTLING

Case Study: Exchange Throttling BAD GOOD

Case Study: Exchange Throttling PROBLEM CONFIRMED WITH EXCHANGE

Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.

Case Study: Exchange Throttling W H AT HAPPENED?

Case Study: Exchange Throttling WE HIT AN IMPLICIT EXCHANGE LIMIT. (ARGUABLY, A BUG.)

LESSONS LEARNED

IT I S P O S S I B L E T O H AV E T O O L I T T L E I N F O R M AT I O N .

“(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF REACTOR 4). THEY CHERNOBYL KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.” - - SVETLANA ALEXIEVICH - VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R

IT IS POSSIBLE TO COLLECT TOO M U C H I N F O R M AT I O N , O R P R E S E N T IT BADLY.

“SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. COMPLEX (…) ONE OF THE LESSONS OF SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.” - C HARLES PERROW - N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES

I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .

“THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN SURE THAT ANYTHING HAD. ONE REASON FOR THEIR IGNORANCE WAS THE IMPERFECT NATURE OF TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD BLOWN UP.” - - H E N R Y S . F. C O O P E R , J R . - X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D

Things that don’t quite work like you’d hope.

Problems • Instrumenting code increases code size. • While very low, runtime impact is not zero. • Instrumentation of dependencies is up to the library author.

Solutions •Dtrace / systemtap hookups •Culture of instrumentation by default. •Tracing BIFs

W O M B AT ?

Instrumentation by default strategies • Application configuration callback module • Behaviors callbacks • Value-at-time functions

W H AT ABOUT CULTURE

LOOK FOR SOLUTIONS, NOT S C A P E G O AT S .

DON’T BE FOOLED BY OVERCONFIDENCE IN YOUR WORK.

DOCUMENT AND DISCUSS YOUR FA I L U R E S .

EVERYONE MUST WORK T O WA R D T H E SAME END.

KNOW SOME BASIC M AT H E M AT I C S .

PREFER LOOSE COUPLING WHEN POSSIBLE. ! BE EXPLICIT ABOUT TIGHT COUPLING.

MEASURE MUCH, P R E S E N T O N LY T H AT WHICH YOU’RE C E R TA I N T O N E E D .

MEASURE, DON’T GUESS.

BE A C C U R AT E .

QUESTIONS?

<3 @bltroutwine BRIAN@TROUTWINE.US

Bonus Bibliography! XIII: The Apollo Flight that Failed Henry S. F. Cooper, Jr. David A. Mindell Digital Apollo: Human and Machine in Spaceflight Charles Perrow Normal Accidents: Living with High-Risk Technologies Voices from Chernobyl: The Oral History of a Nuclear Disaster Svetlana Alexievich Real-Time Systems: Design Principles for Distributed Embedded Applications Hermann Kopetz Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety Eric Schlosser

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

10 Billion a Day, 100 Milliseconds Per: Monitoring Real ...

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll. Recorded at: by Brian Troutwine on Aug 03, 2014 ...
Read more

Erlang Factory 2014 - 10 Billion a Day,100 Milliseconds ...

Erlang Factory 2014 - 10 Billion a Day,100 Milliseconds Per: Monitoring Real Time Bidding at AdRoll
Read more

10 Billion a Day, 100 Milliseconds Per: Monitoring Real ...

10 Billion a Day, 100 Milliseconds Per: Monitoring Real Time Bidding at AdRoll - Portland Erlang / Elixir Meetup
Read more

10 Billion a Day, 100 Milliseconds Per: Monitoring Real ...

Erlang Factory SF is the place to be for anyone ... 10 Billion a Day, 100 Milliseconds Per: ... 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll.
Read more

Presentation: 10 Billion a Day, 100 Milliseconds Per ...

... 100 Milliseconds Per: Monitoring Real-Time Bidding at ... 10 Billion a Day, 100 Milliseconds Per: ... currently at AdRoll where he's a developer ...
Read more

Erlang factory 2014 10 billion a day 100 milliseconds per ...

Full Movie Erlang factory 2014 10 billion a day 100 milliseconds per monitoring real ... 100 Milliseconds Per Monitoring Real Time Bidding At Adroll free ...
Read more

eBay Community Lounge » Calagator: Portland's Tech Calendar

10 Billion a Day, 100 Milliseconds Per: Monitoring Real Time Bidding at AdRoll. Brian Troutwine of Adroll will be joining us tonight. Adroll uses Erlang to ...
Read more

Real-Time Bidding (RTB) Basics - YouTube

100. Like this video? Sign ... How an Ad is Served with Real Time Bidding ... 10 Billion a Day,100 Milliseconds Per: Monitoring Real Time ...
Read more