Hadoop World - Oct 2009

50 %
50 %
Information about Hadoop World - Oct 2009
Technology

Published on October 11, 2009

Author: dgottfrid

Source: slideshare.net

Description

Review of the different things that nytimes.com has been up to w/ Hadoop from the simple to the less simple.

Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009

Evolution of Hadoop @ NYTimes.com

Early Days - 2007 Solution looking for a problem

Solution looking for a problem

Solution Wouldn’t it be cool to use lots of EC2 instances (it’s cheap; nobody will notice) Wouldn’t it be cool to use Hadoop (MapReduce Google style is awesome)

Wouldn’t it be cool to use lots of EC2 instances

(it’s cheap; nobody will notice)

Wouldn’t it be cool to use Hadoop

(MapReduce Google style is awesome)

Found a Problem Freeing up historical archives of NYTimes.com 1851-1922

Freeing up historical archives of NYTimes.com 1851-1922

Problem Bits Articles are served as PDFs Really need PDFs from 1851-1981 PDFs are dynamically generated Free = more traffic Real deadline

Articles are served as PDFs

Really need PDFs from 1851-1981

PDFs are dynamically generated

Free = more traffic

Real deadline

Background What goes into making a PDF of a NYTimes.com article? Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.

Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.

Simple Answer Pre-generate all 11 million PDFs and serve them statically.

Pre-generate all 11 million PDFs and serve them statically.

Solution Copy all the source data to S3 Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs Store the output PDFs in S3 Serve the PDFs out of S3 w/ a signed query string

Copy all the source data to S3

Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs

Store the output PDFs in S3

Serve the PDFs out of S3 w/ a signed query string

A Few Details Limited HDFS - everything loaded in and out of S3 Reduce = 0 - only used for some stats and error reporting

Limited HDFS - everything loaded in and out of S3

Reduce = 0 - only used for some stats and error reporting

Breakdown 4.3 TB of source data into S3 11M PDFS - 1.5 TB output $240 for EC2 - 24hrs x 100 machines

4.3 TB of source data into S3

11M PDFS - 1.5 TB output

$240 for EC2 - 24hrs x 100 machines

TimesMachine http://timesmachine.nytimes.com

Currently - 2009 All that darn data - Web Analytics

All that darn data - Web Analytics

Data Registration / Demographic Articles 1851 - today Usage Data / Web Logs

Registration / Demographic

Articles 1851 - today

Usage Data / Web Logs

Counting Classic cookie tracking - let’s add it up Total PV Total unique users PV per user

Classic cookie tracking - let’s add it up

Total PV

Total unique users

PV per user

A Few Details Using EC2 - 20 Machines Hadoop 0.20.0 12+TB of data Straight MR in Java

Using EC2 - 20 Machines

Hadoop 0.20.0

12+TB of data

Straight MR in Java

Usage Data July 2009 ???M Page Views ??M Unique Users

Merging Data Usage data combined with demographic data.

Usage data combined with demographic data.

Twitter Click Backs By Age Group July 2009

Merging Data Usage data with article meta data

Usage data with article meta data

Usage Data combined with Article Data July 2009 40 Articles

Usage Data combined with Article Data July 2009 40 Articles

Products Coming soon...

Coming soon...

Clustering Moving beyond simple counting and joining Join usage data, demographic information, and article meta data Apply simple k-means clustering

Moving beyond simple counting and joining

Join usage data, demographic information, and article meta data

Apply simple k-means clustering

Clustering

Clustering

Conclusion Large scale computing is transformative for NYTimes.com.

Large scale computing is transformative for NYTimes.com.

Questions? [email_address] @derekg http://open.nytimes.com/

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Hadoop World - Oct 2009 - Technology - documents.mx

Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013)
Read more

Hadoop World: NYC 2009 Tickets, Fri, Oct 2, 2009 at 8:00 ...

Eventbrite - Cloudera presents Hadoop World: NYC 2009 - Friday, October 2, 2009 at The Roosevelt Hotel, New York, NY. Find event and ticket information.
Read more

Hadoop World, NYC 2009 - HubSpot

Hadoop World, NYC 2009. Oct 3, 2009 / by Dan Milstein. Tweet; On the train back from New York, where I just ... Having now been writing Hadoop jobs for ...
Read more

Hadoop World 2009 – some notes from application ... - Atbrox

Oct 03. Hadoop World 2009 ... The View from HadoopWorld (Stephen O’Grady) Post Hadoop World Thoughts ... Hadoop World, NYC 2009 (Dan Milstein) Hadoop ...
Read more

Slides from Hadoop World and University Talks | hadoopnew ...

Slides from Hadoop World and University Talks By ndaley – Wed, Oct 28, 2009 6:37 PM ... Here are the slides from my recent talks at Hadoop World 2009, ...
Read more

October | 2009 | Agile Cat --- in the cloud

10/2/2009 at NY : Hadoop World 2009 Key Notes. おぉ! ... (Posted on CloudAve at Mon, Oct 26, 2009 at 09:32PM) Amazon’s in-cloud database gets MySQL ...
Read more

Hadoop Blog | Yahoo Blog - Yahoo

Find the latest blog posts on Hadoop Blog and leave ... Here are the slides from my recent talks at Hadoop World 2009, ... Hadoop Blog – Mon, Oct 5, 2009 ...
Read more

Apache Hadoop - Wikipedia, the free encyclopedia

Hadoop wins TeraByte Sort (World Record sortbenchmark.org) 2008: July: ... 2009: July: Hadoop Core is renamed Hadoop Common: 2009: July: MapR, ...
Read more