Edge architecture ieee international conference on cloud engineering

100 %
0 %
Information about Edge architecture ieee international conference on cloud engineering
Technology

Published on March 12, 2014

Author: MikeyCohen1

Source: slideshare.net

Netflix’s Global Cloud Edge Architecture Mikey Cohen mikey@netflix.com Edge Engineering Platform Netflix

Over 44 million subscribers in over 40 countries

Netflix accounts for over 30% of peak internet traffic in North America

One billion hours ~ 100,000 years per month...

Netflix supports over 1000 device types

Edge Services ● Front door to Netflix ● Edge Routing - Zuul ● API - Edge Server ● Playback services

How does Netflix Streaming work?* * A simplified view

How does Netflix Streaming work? Netflix Services in Amazon Cloud Your CE Device CDN

Device Under the Hood Netflix Services in Amazon Cloud Your CE Device CDN User Interface Netflix Streaming Platform DRM encodingCE integration

User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

User Interface Loaded User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize

Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize

Obtaining License User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services License

Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services PlayData

Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

Periodic “bookmark” calls note place in movie User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services bookmark

Edge Services - What we are talking about today User Interface Netflix Streaming Platform DRM encoding Your CE Device CDN CE integration bookmarkNetflix Services in Amazon Cloud Edge Services

Edge’s lofty mission ● High Availability ● Good performance ● Data broker between many services and devices in a global, high volume, rapidly innovating, highly dynamic service ● Clients and services are constantly changing

Edge stats ● Billions of incoming requests per day ○ Over 10X outgoing service calls per request ● About 10 device changes per day ● Daily service pushes ● Daily routing changes

Architecture Goals ● Infrastructure ○ Availability ○ Resiliency ○ Scalability ● Application ○ Platform diversity ○ Rapid innovation ○ A/B Testing ● Delivery ○ Automation ○ Insights

Netflix’s Global Cloud Architecture

High Level Regional Edge Architecture ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

Zuul ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

What is Zuul? ● Open source framework for dynamically reading, writing, and executing filters that act on incoming HTTP requests ● Dynamically compiled filters written in Groovy ○ Any JVM language supported ● Filters share state through a request scoped context

How we use Zuul ● Authentication ● Insights ● Stress Testing ● Canary Testing ● Dynamic Routing ● Service Migration ● Load Shedding ● Security ● Static Response handling ● Active/Active traffic management

Zuul Filter Characteristics ● Type ● Execution Order ● Criteria ● Action

Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

Example Filter File: DeviceDelayFilter.groovy 1 class DeviceDelayFilter extends ZuulFilter { 2 3 def static Random rand = new Random() 4 @Override 5 String filterType() { 6 return 'pre' 7 } 8 9 @Override 10 int filterOrder() { 11 return 5 12 } 13 14 @Override 15 boolean shouldFilter() { 16 return RequestContext.getRequest(). 17 getParameter("deviceType")?equals("BrokenDevice"):false 18 } 19 20 @Override 21 Object run() { 22 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds between [0-20] 23 } 24 }

Filter deployment

Active/Active

Multiple Active Regions ZUUL API Cassandra Services ZUUL API Cassandra Services

Multiple Active Regions - NM vs GE ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Multiple Active Regions- Cassandra Replication across regions ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Geo lookup resolves IP in west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Zuul east routes to Zuul west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Response is from west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Regional Failure ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Catastrophe in US-East ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

East Coast is Down ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Switch DNS to point to US-West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

East traffic flows to West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Edge Server (API)

The Edge Service - Netflix’s API Tier ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

What’s wrong with REST for Netflix?

REST ● One Size Fits all ● One Data Format Fits All ● REST tends to be atomic ● Average 25 REST requests to build up a page.

Netflix’s Groovy Scripting Layer

Edge Scripting Tier ● Device teams write scripts for their device ○ control content, format, endpoints ● Code injected directly into Edge Service at runtime ○ Scripts are in production in about 30 seconds

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Edge Server Architecture

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Controller pulls new script / compiles UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Script Activated UI Engineer Activate

Service Layer

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Service Layer

Purpose of the Service Layer ● Interface to business logic (our API) ● Shield data consumers from service changes ● Combine and expose business data in a logical and consistent manner ● All Service Layer methods are async using RxJava ○ Hides concurrency and underlying implementation

RxJava

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services RxJava

RxJava ● Why? ○ How do you expose an async service as an API? ○ Solution to compose async flows and sequences of data ○ Rich set of operators to filter and interact with data

How RxJava Helps ● Need to hide concurrency from script writers ○ Minimize the “bad things” consumers of our API on box can do. ○ Hide the internal implementation ■ Change concurrency of any given call ■ Switch to non-blocking IO

Hystrix Service Resiliency

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Hystrix

How Hystrix helps ● Latency and Fault Tolerance ○ Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery. ○ Thread and semaphore isolation with circuit breakers. ● Realtime Operations ○ Realtime monitoring and configuration changes. Watch service and property changes take effect immediately as they spread across a fleet. ○ Be alerted, make decisions, affect change and see results in seconds. ● Concurrency ○ Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.

Hystrix Dashboard Example

DELIVERY

Edge Delivery ● Continuous deployment ● Automated system integrity analysis ● Tools for facilitating delivery

Automated Deployment Pipeline

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Edge Cluster Organization

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Most Requests to Main Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Some requests to Canary

Canary Analysis

Canary Analysis Detail

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Response Validation

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Fork response to Main and Canary

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Validate response Validate response integrity

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN ELB Finding service Capacity SQUEEZE ORIGIN

Scryer - Predictive auto-scaling ● Why? ○ Reactive doesn’t work in all cases ○ Reacting is sometimes too late ■ Sunday morning cartoons ○ Reactive overreacts ■ Superbowl, World Cup, Outages ■ Fixed size scaling ○ All in All - more reliable and saves money

Daily Traffic Patterns

Scryer Predictions

How does Scryer work? ● Traffic shape analysis ○ Monday vs Monday ○ Sunday vs Sunday, etc ○ FFT based smoothing

Filtering out Noise

Ignoring outages

Accounting for regular spikey traffic

Iteratively apply FFT

Other Scryer Factors ● Traffic volume analysis ○ At least 4 weeks of data ○ Linear regression based on time of day ○ Correct the prediction based on today’s trend. ● Instance factors ○ Instance startup time ○ Instance capacity (obtained by squeeze testing) ● Scale (up/down) actions scheduled based on prediction

The Future

Future - Large Projects on Edge ● Async, non-blocking servers ● Service layer redesign ● Internal Insights ● Global Insights

Edge Architecture Today ELB API Service Netflix Services ELB Streaming Service ELB Zuul Website Service Zuul Zuul

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Global Insights API/ Edge Service Netflix Services Playback Services Zuul User Interface Insight EngineEvent Stream Client Data

User Interface Designs

Netflix in the Cloud - 5 years later Lessons learned

What Did We Learn?

Failure is Assured!

● Code failure - Continuous delivery ● Service failure - fallbacks and redundancy ● Instances and Zone failure - redundancy ● Cloud infrastructure failure - Multiple active regions ● Human failure - Automation Building for Failure

Drawbacks of the cloud ● Some failures are difficult to detect the cause ○ Huge variability in instance performance that are almost impossible to explain. ○ Network barriers ○ Multi tenancy ○ Firewalls ● Very limited access to information/ ability to fix issues

Software focus: Cloud’s greatest strength ● Scale our business ● Automate processes ● Radically experiment ● Remain resilient ● Move quickly

Netflix Culture - Our secret sauce ● Freedom and responsibility ● Highly aligned teams ● Aversion to process ● Design for necessity ● Design for failure ● Engineering teams operating their services

Netflix OSS ● Zuul - Smart edge router ● RxJava - Functional reactive libraries ● Hystrix - SOA resiliency ● + a lot more!

For more Info on Netflix Cloud Technology: Read our Technology Blog : http://techblog.netflix.com/ Check out our Open Source Cloud Projects : http://netflix.github.io

Add a comment

Related presentations

Related pages

IC2E 2016 - IEEE Computer Society

... conference attendees. None of the IC2E ... International Conference on Cloud Engineering ... IEEE International Conference on Cloud ...
Read more

Home - IEEE Cloud Computing

Welcome to the IEEE Cloud Computing Web ... The 2nd IEEE International Conference on Big Data Security ... cloud architecture, cloud native design ...
Read more

IEEE Mobile Cloud 2016 | The 4th IEEE International ...

IEEE Mobile Cloud 2016 The 4th IEEE International Conference on Mobile Cloud Computing, Services, and Engineering Mar 29 ... IEEE Mobile Cloud 2016 will ...
Read more

IEEE - Conferences & Events - IEEE - The world's largest ...

Search for IEEE-sponsored conferences and ... 2016 IEEE 32nd International Conference on Data Engineering ... IEEE produces cutting-edge conference ...
Read more

IEEE CLOUD 2014

The 2014 IEEE 7th International Conference on Cloud Computing ... the IEEE in scientific, engineering, ... architecture (SOA), cloud ...
Read more

IEEE Xplore Digital Library

IEEE Xplore. Delivering full text access to the world's highest quality technical literature in engineering and technology.
Read more

About IEEE CLOUD 2015

About IEEE CLOUD 2015. ... IEEE International Conference on Cloud Computing ... Software Engineering. CLOUD Proceedings are EI indexed. ...
Read more

ICDE 2016

... IEEE ICDE 2016 Conference Data Engineering to be held in Helsinki. This page is about Helsinki Data Engineering Conference ... IEEE International ...
Read more