Maintaining the Front Door to Netflix : The Netflix API

33 %
67 %
Information about Maintaining the Front Door to Netflix : The Netflix API
Technology

Published on March 12, 2014

Author: danieljacobson

Source: slideshare.net

Description

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation

Global Streaming Video for TV Shows and Movies

More than 44 Million Subscribers More than 40 Countries

Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month

Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming

Key Responsibilities • Broker data between services and UIs • Maintain a resilient front-door • Scale the system vertically and horizontally • Maintain high velocity

But Before Streaming…

Monolithic Application In Netflix Data Centers

The bigger the ship… the slower it turns

Distributed Architecture

1000+ Device Types

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Dependency Relationships

2,000,000,000 Requests Per Day to the Netflix API

30 Distinct Dependent Services for the Netflix API

~500 Dependency jars Slurped into the Netflix API

14,000,000,000 Netflix API Calls Per Day to those Dependent Services

0 Dependent Services with 100% SLA

99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month

99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Circuit Breaker Dashboard

Call Volume and Health / Last 10 Seconds

Call Volume / Last 2 Minutes

Successful Requests

Successful, But Slower Than Expected

Short-Circuited Requests, Delivering Fallbacks

Timeouts, Delivering Fallbacks

Thread Pool & Task Queue Full, Delivering Fallbacks

Exceptions, Delivering Fallbacks

Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate

Status of Fallback Circuit

Requests per Second, Over Last 10 Seconds

SLA Information

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback

Scaling the Distributed System

AWS Cloud

Autoscaling

Autoscaling

Amazon Auto Scaling Limitations • Hard to fit policies to variable traffic patterns (weekday vs weekend) • Limited control over capacity adjustments (absolute value or %)

The Impact of AAS Limitations • Traffic drop can lead to scale downs during outage • Performance degradation between new instance launch and taking traffic • Excess capacity at peak and trough

Scryer : Predictive Auto Scaling Not yet…

Typical Traffic Patterns Over Five Days

Predicted RPS Compared to Actual RPS

Scaling Plan for Predicted Workload

What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events

Results

Results : Load Average Reactive Predictive

Results : Response Latencies Reactive Predictive

Results : Outage Recovery

Results : Outage Recovery

Results : AWS Costs

Scaling Globally

More than 44 Million Subscribers More than 40 Countries

Zuul Gatekeeper for the Netflix Streaming Application

Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy

Isthmus

All of these approaches are designed to prevent failures…

But sometimes the best way to prevent failures is to force them!

I randomly terminate instances in production to identify dormant failures. Chaos Monkey

Chaos Gorilla I simulate an outage of an entire Amazon availability zone.

I simulate an outage in an AWS region. Chaos Kong

I find instances that don’t adhere to best practices. Conformity Monkey

I extend Conformity Monkey to find security violations. Security Monkey

I detect unhealthy instances and remove them from service. Doctor Monkey

I clean up the clutter and waste that runs in the cloud. Janitor Monkey

I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey

Deployments in the Cloud

Dependency Relationships

Testing Philosophy: Act Fast, React Fast

That Doesn’t Mean We Don’t Test

Automated Delivery Pipeline

Cloud-Based Deployment Techniques

Current Code In Production API Requests from the Internet

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet

Canary Analysis Automation

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!

Current Code In Production API Requests from the Internet

Current Code In Production API Requests from the Internet

Current Code In Production API Requests from the Internet Perfect!

Stress Test with Zuul

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

Current Code In Production API Requests from the Internet Perfect!

Stress Test with Zuul

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production

API Requests from the Internet New Code Getting Prepared for Production

Brokering Data to 1,000+ Device Types

Screen Real Estate

Controller

Technical Capabilities

One-Size-Fits-All API Request Request Request

Courtesy of South Florida Classical Review

Resource-Based API vs. Experience-Based API

Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people

REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border

RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE

RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING

Experience-Based Requests • /ps3/homescreen

JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer

RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border

RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border

https://www.github.com/Netflix

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

Add a comment

Related presentations

Related pages

Evolution of the Netflix API - QCon SF 2013 // Speaker Deck

At the center of that architecture is the Netflix API, which is the front door to the entire system.
Read more

The Netflix Tech Blog: Nicobar: Dynamic Scripting Library ...

The Netflix API is the front door to the streaming service, ... a single Netflix API instance hosts hundreds of UI scripts, ... About the Netflix Tech Blog.
Read more

The Netflix Tech Blog: The Netflix Dynamic Scripting Platform

... we have optimized the Netflix API with a view towards improving performance and ... This opens the door for increased ... About the Netflix Tech ...
Read more

Home · Netflix/zuul Wiki · GitHub

Zuul is the front door for all requests from ... The volume and diversity of Netflix API traffic sometimes ... How We Use Zuul At Netflix. Zuul ...
Read more

Netflix Api | LinkedIn

Netflix Api. Articles, experts, jobs, and more: get all the professional insights you need on LinkedIn. ... Engineering Manager at Netflix, ...
Read more

Netflix

Sign In. Email Password. Forgot your email or password? ... Login with Facebook. New to Netflix? Sign up now. Questions? Call 1-866-579-7172. Gift Card ...
Read more

Netflix - Watch TV Shows Online, Watch Movies Online

Watch TV shows and movies anytime, anywhere. Only $7.99 a month. Start your free month.
Read more