advertisement

Edge architecture ieee international conference on cloud engineering

100 %
0 %
advertisement
Information about Edge architecture ieee international conference on cloud engineering
Technology

Published on March 12, 2014

Author: MikeyCohen1

Source: slideshare.net

advertisement

Netflix’s Global Cloud Edge Architecture Mikey Cohen mikey@netflix.com Edge Engineering Platform Netflix

Over 44 million subscribers in over 40 countries

Netflix accounts for over 30% of peak internet traffic in North America

One billion hours ~ 100,000 years per month...

Netflix supports over 1000 device types

Edge Services ● Front door to Netflix ● Edge Routing - Zuul ● API - Edge Server ● Playback services

How does Netflix Streaming work?* * A simplified view

How does Netflix Streaming work? Netflix Services in Amazon Cloud Your CE Device CDN

Device Under the Hood Netflix Services in Amazon Cloud Your CE Device CDN User Interface Netflix Streaming Platform DRM encodingCE integration

User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

User Interface loaded, data retrieved from Netflix Edge Service User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

User Interface Loaded User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize

Movie Authorization User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services Authorize

Obtaining License User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services License

Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services PlayData

Movie starts streaming User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services

Periodic “bookmark” calls note place in movie User Interface Netflix Streaming Platform DRM Netflix Services in Amazon Cloud encoding Your CE Device CDN CE integration Edge Services bookmark

Edge Services - What we are talking about today User Interface Netflix Streaming Platform DRM encoding Your CE Device CDN CE integration bookmarkNetflix Services in Amazon Cloud Edge Services

Edge’s lofty mission ● High Availability ● Good performance ● Data broker between many services and devices in a global, high volume, rapidly innovating, highly dynamic service ● Clients and services are constantly changing

Edge stats ● Billions of incoming requests per day ○ Over 10X outgoing service calls per request ● About 10 device changes per day ● Daily service pushes ● Daily routing changes

Architecture Goals ● Infrastructure ○ Availability ○ Resiliency ○ Scalability ● Application ○ Platform diversity ○ Rapid innovation ○ A/B Testing ● Delivery ○ Automation ○ Insights

Netflix’s Global Cloud Architecture

High Level Regional Edge Architecture ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

Zuul ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

What is Zuul? ● Open source framework for dynamically reading, writing, and executing filters that act on incoming HTTP requests ● Dynamically compiled filters written in Groovy ○ Any JVM language supported ● Filters share state through a request scoped context

How we use Zuul ● Authentication ● Insights ● Stress Testing ● Canary Testing ● Dynamic Routing ● Service Migration ● Load Shedding ● Security ● Static Response handling ● Active/Active traffic management

Zuul Filter Characteristics ● Type ● Execution Order ● Criteria ● Action

Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

Zuul Filter Lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

zuul filter lifecycle HTTP Request "pre" filters "routing" filter(s) "post" filters Origin Server "custom" filters Http Request Http Response "error" filters

Example Filter File: DeviceDelayFilter.groovy 1 class DeviceDelayFilter extends ZuulFilter { 2 3 def static Random rand = new Random() 4 @Override 5 String filterType() { 6 return 'pre' 7 } 8 9 @Override 10 int filterOrder() { 11 return 5 12 } 13 14 @Override 15 boolean shouldFilter() { 16 return RequestContext.getRequest(). 17 getParameter("deviceType")?equals("BrokenDevice"):false 18 } 19 20 @Override 21 Object run() { 22 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds between [0-20] 23 } 24 }

Filter deployment

Active/Active

Multiple Active Regions ZUUL API Cassandra Services ZUUL API Cassandra Services

Multiple Active Regions - NM vs GE ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Multiple Active Regions- Cassandra Replication across regions ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

DNS Misrouting ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Geo lookup resolves IP in west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Zuul east routes to Zuul west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Response is from west ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS GEO

Regional Failure ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Catastrophe in US-East ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

East Coast is Down ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Switch DNS to point to US-West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

East traffic flows to West ZUUL API Cassandra Services ZUUL API Cassandra ServicesDNS DNS

Edge Server (API)

The Edge Service - Netflix’s API Tier ELB Edge Service Netflix Services ELB Playback Service ELB Zuul Website Service

What’s wrong with REST for Netflix?

REST ● One Size Fits all ● One Data Format Fits All ● REST tends to be atomic ● Average 25 REST requests to build up a page.

Netflix’s Groovy Scripting Layer

Edge Scripting Tier ● Device teams write scripts for their device ○ control content, format, endpoints ● Code injected directly into Edge Service at runtime ○ Scripts are in production in about 30 seconds

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Edge Server Architecture

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Pushing a Script UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Controller pulls new script / compiles UI Engineer /ps3/home script

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Script Activated UI Engineer Activate

Service Layer

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Service Layer

Purpose of the Service Layer ● Interface to business logic (our API) ● Shield data consumers from service changes ● Combine and expose business data in a logical and consistent manner ● All Service Layer methods are async using RxJava ○ Hides concurrency and underlying implementation

RxJava

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services RxJava

RxJava ● Why? ○ How do you expose an async service as an API? ○ Solution to compose async flows and sequences of data ○ Rich set of operators to filter and interact with data

How RxJava Helps ● Need to hide concurrency from script writers ○ Minimize the “bad things” consumers of our API on box can do. ○ Hide the internal implementation ■ Change concurrency of any given call ■ Switch to non-blocking IO

Hystrix Service Resiliency

Endpoint Code (Groovy) Endpoint Controller RxJava Async Service Layer API Hystrix (Fault tolerance)Endpoint Manager JVM Netflix Services Hystrix

How Hystrix helps ● Latency and Fault Tolerance ○ Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery. ○ Thread and semaphore isolation with circuit breakers. ● Realtime Operations ○ Realtime monitoring and configuration changes. Watch service and property changes take effect immediately as they spread across a fleet. ○ Be alerted, make decisions, affect change and see results in seconds. ● Concurrency ○ Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.

Hystrix Dashboard Example

DELIVERY

Edge Delivery ● Continuous deployment ● Automated system integrity analysis ● Tools for facilitating delivery

Automated Deployment Pipeline

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Edge Cluster Organization

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Most Requests to Main Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Some requests to Canary

Canary Analysis

Canary Analysis Detail

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Response Validation

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Fork response to Main and Canary

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Validate response Validate response integrity

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Targeted Debugging

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN SQUEEZE ORIGIN ELB Squeezing the Origin

ZUUL ZUUL-CANARY ZUUL-DEBUG ZUUL-SQUEEZE MAIN ORIGIN CANARY ORIGIN DEBUG ORIGIN ELB Finding service Capacity SQUEEZE ORIGIN

Scryer - Predictive auto-scaling ● Why? ○ Reactive doesn’t work in all cases ○ Reacting is sometimes too late ■ Sunday morning cartoons ○ Reactive overreacts ■ Superbowl, World Cup, Outages ■ Fixed size scaling ○ All in All - more reliable and saves money

Daily Traffic Patterns

Scryer Predictions

How does Scryer work? ● Traffic shape analysis ○ Monday vs Monday ○ Sunday vs Sunday, etc ○ FFT based smoothing

Filtering out Noise

Ignoring outages

Accounting for regular spikey traffic

Iteratively apply FFT

Other Scryer Factors ● Traffic volume analysis ○ At least 4 weeks of data ○ Linear regression based on time of day ○ Correct the prediction based on today’s trend. ● Instance factors ○ Instance startup time ○ Instance capacity (obtained by squeeze testing) ● Scale (up/down) actions scheduled based on prediction

The Future

Future - Large Projects on Edge ● Async, non-blocking servers ● Service layer redesign ● Internal Insights ● Global Insights

Edge Architecture Today ELB API Service Netflix Services ELB Streaming Service ELB Zuul Website Service Zuul Zuul

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Future Edge Architecture ELB API/ Edge Service Netflix Services Playback Services ELB Zuul Website

Global Insights API/ Edge Service Netflix Services Playback Services Zuul User Interface Insight EngineEvent Stream Client Data

User Interface Designs

Netflix in the Cloud - 5 years later Lessons learned

What Did We Learn?

Failure is Assured!

● Code failure - Continuous delivery ● Service failure - fallbacks and redundancy ● Instances and Zone failure - redundancy ● Cloud infrastructure failure - Multiple active regions ● Human failure - Automation Building for Failure

Drawbacks of the cloud ● Some failures are difficult to detect the cause ○ Huge variability in instance performance that are almost impossible to explain. ○ Network barriers ○ Multi tenancy ○ Firewalls ● Very limited access to information/ ability to fix issues

Software focus: Cloud’s greatest strength ● Scale our business ● Automate processes ● Radically experiment ● Remain resilient ● Move quickly

Netflix Culture - Our secret sauce ● Freedom and responsibility ● Highly aligned teams ● Aversion to process ● Design for necessity ● Design for failure ● Engineering teams operating their services

Netflix OSS ● Zuul - Smart edge router ● RxJava - Functional reactive libraries ● Hystrix - SOA resiliency ● + a lot more!

For more Info on Netflix Cloud Technology: Read our Technology Blog : http://techblog.netflix.com/ Check out our Open Source Cloud Projects : http://netflix.github.io

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

ICCAC 2017 : 2017 IEEE International Conference on Cloud ...

... 2017 IEEE International Conference on Cloud and ... computer architecture, ... aspects of combining cloud computing with fog and edge ...
Read more

IEEE CloudCom 2016 - Cloudcom 2016

8th IEEE International Conference on Cloud Computing ... Architecture and Virtualization; Cloud Services ... Distributed Cloud / Cloud Brokering / Edge ...
Read more

4th IEEE International Conference on Mobile Cloud ...

2016 4th IEEE International Conference on Mobile Cloud Computing, ... Common Platform Architecture for Network Function ... Mobile Edge Computing: Progress ...
Read more

Conferences and Meetings on Grid-, Distributed- and ...

... on Grid-, Distributed- and Parallel Computing, ... International Conference on Cloud Engineering. ID. ... The IEEE International Conference on Cloud ...
Read more

A Hierarchical Edge Cloud Architecture for Mobile Computing

A Hierarchical Edge Cloud Architecture for ... Department of Electrical Engineering and ... The 35th Annual IEEE International Conference on Computer ...
Read more

IEEE - Conferences & Events

... Conferences & Events (IEEE MCE ... IEEE produces cutting-edge conference publications in various technology areas that are recognized by academia ...
Read more

ICDE May 16-20 2016, Helsinki

... IEEE ICDE 2016 Conference Data Engineering to be held in Helsinki. This page is about Helsinki Data Engineering Conference ... IEEE International ...
Read more