2014 feb 5_what_ishadoop_mda

50 %
50 %
Information about 2014 feb 5_what_ishadoop_mda
Technology

Published on February 6, 2014

Author: adammuise

Source: slideshare.net

Description

What Is Hadoop update to include YARN, Tez, Spark, and Storm

Adam  Muise  –  Hortonworks   WELCOME  TO  HADOOP  

Who  am  I?  

Why  are  we  here?  

Data  

“Big  Data”  is  the  marke=ng  term   of  the  decade  

What  lurks  behind  the  hype  is   the  democra=za=on  of  Data.  

You  need  to  deal  with  Data.  

You’re  probably  not  as  good  at   that  as  you  think.  

Put  it  away,  delete  it,  tweet  it,   compress  it,  shred  it,  wikileak-­‐it,  put   it  in  a  database,  put  it  in  SAN/NAS,   put  it  in  the  cloud,  hide  it  in  tape…  

Let’s  talk  challenges…  

Volume   Volume   Volume   Volume  

Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume  

Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume  

Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume   Volume  Volume   Volume   Volume   Volume  

Storage,  Management,  Processing   all  become  challenges  with  Data  at   Volume  

Tradi=onal  technologies  adopt  a   divide,  drop,  and  conquer  approach  

Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   The  solu=on?   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data  

Another  EDW   Analy=cal  DB   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   OLTP   Ummm…you   dropped  something   EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Yet  Another  EDW   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  

Analyzing  the  data  usually  raises   more  interes=ng  ques=ons…  

…which  leads  to  more  data  

Wait,  you’ve  seen  this  before.   Data   Data   Data   …   Sausage  Factory   Data   Data   Data   Data   Data   Data   Data   Data   Data   …   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  

Data  begets  Data.  

What  keeps  us  from  Data?  

“Prices,  Stupid  passwords,  and   Boring  Sta=s=cs.”     -­‐  Hans  Rosling   h"p://www.youtube.com/watch?v=hVimVzgtD6w  

Your  data  silos  are  lonely  places.   EDW   Accounts   Customers   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  

…  Data  likes  to  be  together.   EDW   Accounts   Customers   Data   Data   Web  Proper=es   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  

CDR   Data   Data   Data   Machine  Data   Facebook   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Weather  Data   TwiYer   Data   Data  likes  to  socialize  too.   Data   Data   EDW   Data   Data   Data   Data   Data   Data   Accounts   Data   Web  Proper=es   Data   Data   Data   Customers   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data  

New  types  of  data  don’t  quite  fit  into   your  pris=ne  view  of  the  world.   Logs   Data   Data   Data   Data   Data  Data   Data   Machine  Data   Data   Data   Data   Data   Data  Data   Data   My  LiYle  Data  Empire   Data   ?   Data   ?   Data   Data   Data   Data   Data   ?  ?   Data   Data  

To  resolve  this,  some  people  take   hints  from  Lord  Of  The  Rings...  

…and  create  One-­‐Schema-­‐To-­‐ Rule-­‐Them-­‐All…   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  

ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data   …but  that  has  its  problems  too.   ETL   Data   Data   Data   ETL   ETL   ETL   EDW   Data   Data   Data   Data   Data   Schema   Data   Data   Data   Data  

So  what  is  the  answer?  

Enter  the  Hadoop.   ………   hYp://www.fabulouslybroke.com/2011/05/ninja-­‐elephants-­‐and-­‐other-­‐awesome-­‐stories/  

Hadoop  was  created  because  Big  IT   never  cut  it  for  the  Internet   Proper=es  like  Google,  Yahoo,   Facebook,  TwiYer,  and  LinkedIn  

Tradi=onal  architecture  didn’t   scale  enough…   App   App   App   App   App   App   App   App   DB   DB   DB   SAN   App   App   App   App   DB   DB   DB   SAN   DB   DB   DB   SAN  

Databases  become  bloated  and   useless  

$upercompu=ng   Tradi=onal  architectures  cost  too   much  at  that  volume…   $/TB   $pecial   Hardware  

How  would  you  fix  this?  

If  you  could  design  a  system  that   would  handle  this,  what  would  it   look  like?  

It  would  probably  need  a  highly   resilient,  self-­‐healing,  cost-­‐efficient,   distributed  file  system…   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage   Storage  

It  would  probably  need  a  completely   parallel  processing  framework  that   took  tasks  to  the  data…   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  

It  would  probably  run  on  commodity   hardware,  virtualized  machines,  and   common  OS  plaeorms   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage   Processing   Processing  Processing   Storage   Storage   Storage  

It  would  probably  be  open  source  so   innova=on  could  happen  as  quickly   as  possible  

It  would  need  a  cri=cal  mass  of   users  

It  would  be  Apache  Hadoop  

{Processing  +  Storage}   =   {MapReduce/Tez/YARN+  HDFS}  

HDFS  stores  data  in  blocks  and   replicates  those  blocks   block1   Processing   Processing  Processing   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   block3  

If  a  block  fails  then  HDFS  always  has   the  other  copies  and  heals  itself   block1   Processing   Processing  Processing   block3   Storage   Storage   Storage   block2   block2   Processing   Processing  Processing   block1   Storage   Storage   Storage   block3   block2   Processing   Storage   block3   Processing  Processing   block1   Storage   Storage   X

MapReduce  is  a  programming   paradigm  that  completely  parallel   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  

MapReduce  has  three  phases:   Map,  Sort/Shuffle,  Reduce   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Mapper   Key,  Value   Key,  Value   Key,  Value   Reducer   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Reducer   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Key,  Value   Mapper   Key,  Value   Key,  Value   Key,  Value  

MapReduce  applies  to  a  lot  of   data  processing  problems   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Mapper   Mapper   Mapper   Mapper   Mapper   Reducer   Data   Data   Data   Reducer   Data   Data   Data   Reducer   Data   Data   Data  

MapReduce  goes  a  long  way,  but   not  all  data  processing  and  analy=cs   are  solved  the  same  way  

Some=mes  your  data  applica=on   needs  parallel  processing  and  inter-­‐ process  communica=on   Data   Data   Data   Data   Data   Data   Process   Data   Data   Data   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data   Process   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data  

…like  Complex  Event  Processing   in  Apache  Storm  

Some=mes  your  machine  learning   data  applica=on  needs  to  process  in   memory  and  iterate     Data   Data   Data   Data   Data   Data   Process   Data   Data   Data   Process   Data   Data   Data   Data   Data   Data   Data   Data   Data   Process   Process   Process   Process   Process   Data   Data   Data   Data   Data   Data  

…like  in  Machine  Learning  in   Spark  

Introducing  YARN  

YARN  =  Yet  Another  Resource   Nego=ator  

YARN  abstracts  resource   management  so  you  can  run  more   than  just  MapReduce   MapReduce  V2   MapReduce  V?   STORM   Giraph   Tez   YARN   HDFS2   MPI   HBase   …  and   more   Spark  

Node  Manager   Resource  Manager   Container   Scheduler   Pig   AppMaster   Container   Resource  Manager   +   Node  Managers   =  YARN   Node  Manager   Container   Container   Storm   Node  Manager   Node  Manager   MapReduce   AppMaster   Container   Container   Container   Container   Container   AppMaster  

YARN  turns  Hadoop  into  a  smart   phone:  An  App  Ecosystem   hortonworks.com/yarn/  

Check  out  the  book  too…   Preview  at:   hortonworks.com/yarn/  

YARN  is  an  essen=al  part  of  a   balanced  breakfast  in  Hadoop  2.x  

Introducing  Tez  

Tez  is  a  YARN  applica=on,  like   MapReduce  is  a  YARN  applica=on  

Tez  is  the  Lego  set  for  your  data   applica=on  

Tez  provides  a  layer  for  abstract   tasks,  these  could  be  mappers,   reducers,  customized  stream   processes,  in  memory  structures,   etc  

Tez  can  chain  tasks  together  into  one   job  to  get  Map  –  Reduce  –  Reduce  jobs   suitable  for  things  like  Hive  SQL   projec=ons,  group  by,  and  order  by   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   TezMap   TezMap   TezReduce   TezReduce   Data   Data   Data   TezMap   TezReduce   TezReduce   Data   Data   Data   TezReduce   TezReduce   TezMap   TezMap   Data   Data   Data  

Tez  can  provide  long-­‐running   containers  for  applica=ons  like  Hive   to  side-­‐step  batch  process  startups   you  would  have  with  MapReduce  

Hadoop  has  other  open  source   projects…  

Hive  =  {SQL  -­‐>  Tez  ||  MapReduce}   SQL-­‐IN-­‐HADOOP  

Pig  =  {PigLa=n  -­‐>  Tez  ||   MapReduce}  

HCatalog  =  {metadata*  for   MapReduce,  Hive,  Pig,  HBase}   *metadata  =  tables,  columns,  par==ons,  types  

Oozie  =  Job::{Task,  Task,  if  Task,   then  Task,  final  Task}  

Falcon   Feed   Feed   Feed   Feed   Hadoop   DR   Feed   Replica=on   Feed   Feed   Hadoop   Feed  

Knox   REST   Client   REST   Client   Knox  Gateway   REST   Client   Hadoop   Cluster   Hadoop   Cluster   Enterprise   LDAP  

Flume   Files   Flume   JMS   Weblogs   Events   Flume   Flume   Flume   Flume   Flume   Hadoop  

Sqoop   DB   DB   Sqoop   Hadoop   Sqoop  

Ambari  =  {install,  manage,   monitor}  

HBase  =  {real-­‐=me,  distributed-­‐ map,  big-­‐tables}  

Storm  =  {Complex  Event  Processing,   Near-­‐Real-­‐Time,  Provisioned  by   YARN  }  

Tez   Storm   YARN   Pig   HDFS   MapReduce   Apache  Hadoop   HCatalog   Hive   HBase   Ambari   Knox   Sqoop   Falcon   Flume  

Storm   Tez   Pig   YARN   HDFS   MapReduce   Hortonworks  Data  Plaeorm   HCatalog   Hive   HBase   Ambari   Knox   Sqoop   Falcon   Flume  

What  else  are  we  working  on?   hortonworks.com/labs/  

Hadoop  is  the  new  Modern  Data   Architecture  

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...