advertisement

Cloud Computing: Hadoop

50 %
50 %
advertisement
Information about Cloud Computing: Hadoop
Technology

Published on November 20, 2008

Author: darugar

Source: slideshare.net

Description

Data Processing in the Cloud with Hadoop from Data Services World conference.
advertisement

Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email_address]

What is Hadoop Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.

Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.

Why? A common infrastructure pattern extracted from building distributed systems Scale Incremental growth Cost Flexibility

A common infrastructure pattern extracted from building distributed systems

Scale

Incremental growth

Cost

Flexibility

Built-in Resilience to Failure When dealing with large numbers of commodity servers, failure is a fact of life Assume failure, build protections and recovery into your architecture Data level redundancy Job/Task level monitoring and automated restart and re-allocation

When dealing with large numbers of commodity servers, failure is a fact of life

Assume failure, build protections and recovery into your architecture

Data level redundancy

Job/Task level monitoring and automated restart and re-allocation

Current State of Hadoop Project Top level Apache Foundation project In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, … Large, active user base, mailing lists, user groups Very active development, strong development team

Top level Apache Foundation project

In production use at Yahoo, Facebook, Amazon, IBM, Fox, NY Times, Powerset, …

Large, active user base, mailing lists, user groups

Very active development, strong development team

Widely Adopted A valuable and reusable skill set Taught at major universities Easier to hire for Easier to train on Portable across projects, groups

A valuable and reusable skill set

Taught at major universities

Easier to hire for

Easier to train on

Portable across projects, groups

Plethora of Related Projects Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout

Pig

Hive

Hbase

Cascading

Hadoop on EC2

JAQL , X-Trace, Happy, Mahout

What is Hadoop The Linux of distributed processing.

The Linux of distributed processing.

How Does Hadoop Work?

Hadoop File System A distributed file system for large data Your data in triplicate Built-in redundancy, resiliency to large scale failures Intelligent distribution, striping across racks Accommodates very large data sizes On commodity hardware

A distributed file system for large data

Your data in triplicate

Built-in redundancy, resiliency to large scale failures

Intelligent distribution, striping across racks

Accommodates very large data sizes

On commodity hardware

Programming Model: Map/Reduce Very simple programming model: Map(anything)->key, value Sort, partition on key Reduce(key,value)->key, value No parallel processing / message passing semantics Programmable in Java or any other language (streaming)

Very simple programming model:

Map(anything)->key, value

Sort, partition on key

Reduce(key,value)->key, value

No parallel processing / message passing semantics

Programmable in Java or any other language (streaming)

Processing Model Create or allocate a cluster Put data onto the file system: Data is split into blocks, stored in triplicate across your cluster Run your job: Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data Move computation to data, not data to computation

Create or allocate a cluster

Put data onto the file system:

Data is split into blocks, stored in triplicate across your cluster

Run your job:

Your Map code is copied to the allocated nodes, preferring nodes that contain copies of your data

Move computation to data, not data to computation

Processing Model Monitor workers, automatically restarting failed or slow tasks Gather output of Map, sort and partition on key Run Reduce tasks Monitor workers, automatically restarting failed or slow tasks Results of your job are now available on the Hadoop file system

Monitor workers, automatically restarting failed or slow tasks

Gather output of Map, sort and partition on key

Run Reduce tasks

Monitor workers, automatically restarting failed or slow tasks

Results of your job are now available on the Hadoop file system

Hadoop on the Grid Managed Hadoop clusters Shared resources improved utilization Standard data sets, storage Shared, standardized operations management Hosted internally or externally (eg. on EC2)

Managed Hadoop clusters

Shared resources

improved utilization

Standard data sets, storage

Shared, standardized operations management

Hosted internally or externally (eg. on EC2)

Usage Patterns

ETL Put large data source (eg. Log files) onto the Hadoop File System Perform aggregations, transformations, normalizations on the data Load into RDBMS / data mart

Put large data source (eg. Log files) onto the Hadoop File System

Perform aggregations, transformations, normalizations on the data

Load into RDBMS / data mart

Reporting and Analytics Run canned and ad-hoc queries over large data Run analytics and data mining operations on large data Produce reports for end-user consumption or loading into data mart

Run canned and ad-hoc queries over large data

Run analytics and data mining operations on large data

Produce reports for end-user consumption or loading into data mart

Data Processing Pipelines Multi-step pipelines for data processing Coordination, scheduling, data collection and publishing of feeds SLA carrying, regularly scheduled jobs

Multi-step pipelines for data processing

Coordination, scheduling, data collection and publishing of feeds

SLA carrying, regularly scheduled jobs

Machine Learning & Graph Algorithms Traverse large graphs and data sets, building models and classifiers Implement machine learning algorithms over massive data sets

Traverse large graphs and data sets, building models and classifiers

Implement machine learning algorithms over massive data sets

General Back-End Processing Implement significant portions of back-end, batch oriented processing on the grid General computation framework Simplify back-end architecture

Implement significant portions of back-end, batch oriented processing on the grid

General computation framework

Simplify back-end architecture

What Next? Dowload Hadoop: http://hadoop.apache.org/ Try it on your laptop Try Pig http://hadoop.apahe.org/pig/ Deploy to multiple boxes Try it on EC2

Dowload Hadoop:

http://hadoop.apache.org/

Try it on your laptop

Try Pig

http://hadoop.apahe.org/pig/

Deploy to multiple boxes

Try it on EC2

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Hadoop and cloud computing: Collision course or happy ...

According to Forrester, two of the industry's hottest trends -- cloud computing and Hadoop -- may not work well together. This theory, however, doesn't ...
Read more

Cloud Computing - UOIT HADOOP

Cloud Computing. Cloud Computing is the set of applications or/and services that run on the distributed network by using third-party physical or/and ...
Read more

Cloud Computing with Hadoop - inferdata.com

Description Hadoop is an open-source Cloud computing environment that implements the Google tm MapReduce framework in Java.
Read more

Google Cloud Computing, Hosting Services & APIs | Google ...

Google Cloud Platform lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.
Read more

Microsoft Azure: Cloud Computing Platform & Services

Microsoft Azure is an open, flexible, enterprise-grade cloud computing platform. Move faster, do more, and save money with IaaS + PaaS. Try for FREE.
Read more

Apache Hadoop - Wikipedia, the free encyclopedia

Apache Hadoop (pronunciation: / h ... Cloud computing; Data Intensive Computing; HPCC – LexisNexis Risk Solutions High Performance Computing Cluster;
Read more

Cloud computing with Linux and Apache Hadoop - IBM

Many companies like IBM, Google, VMWare, and Amazon have provided products and strategies for Cloud computing. This article shows you how to use Apache ...
Read more

Cloud computing - Wikipedia, the free encyclopedia

Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand.
Read more

Microsoft Azure: Cloud-Computing-Plattform und -Dienste

Microsoft Azure ist eine offene, flexible Cloud Computing-Plattform für Unternehmen. Mit IaaS und PaaS agieren Sie schneller, erledigen Ihre Aufgaben ...
Read more

Cloud Computing für Ihr Unternehmen | Microsoft Cloud Services

Cloud Computing verändert die Welt grundlegend und stellt gewohnte Infrastrukturen auf den Prüfstand: Flexible, geräteunabhängige Cloud-Lösungen ...
Read more