advertisement

MapReduce and Hadoop

50 %
50 %
advertisement
Information about MapReduce and Hadoop
Technology

Published on March 15, 2014

Author: NicolaCadenelli

Source: slideshare.net

Description

And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.

The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
advertisement

MapReduce and Hadoop Cadenelli Nicola Datenbanken Implementierungstechniken

Introduction ● History ● Motivations MapReduce ● What MapReduce is ● Why it is usefull ● Execution Details ● Some Examples ● Conclusions Outline Hadoop ● Introduction ● Hadoop Architecture ● Hadoop Ecosystem ● In real world MapReduce&Databases ● SQL-MapReduce ● In-Database Map-Reduce ● Conclusions Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS MapReduce BigTable HDFS MapReduce Introduction MapReduce Hadoop MR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google publishes the papers 2006: Apache releases Hadoop. Is the first Open Source implementation of GFS and MapReduce. Now: Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify, Twitter and more are using Hadoop. A Brief History Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB. E.g: Large Synoptic Survey Telescope (30TB / night) ● The best idea is to scale out (not scale up) the system, but . . .  How do we scale to more than 1000+ machines?  How do we handle machine failures?  How can we facilitate communications between nodes?  If we change system, do we lose all our optimisation work? ● Google needed to recreate the index of the web. Motivations Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004. It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …). What is MapReduce? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime library: ● Parallelization and Distribution ● Load balancing ● Network and disk transfer optimization ● Handling of machine failures ● Fault tolerance ● Monitoring & status updates All users obtain benefits from improvements on the core library. Why is MapReduce useful? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data 2. Map: extract something we care about from each record 3. Shuffle and Sort 4. Reduce: aggregate, summarize, filter, or transform 5. Write the results From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem. Typical problem solved by MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves. ● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!! ● Mappers save outputs to local disk before serving them to reducers. ● If a map or reduce crashes: Re-execute! ● Allows having more mappers and reducers than nodes. Some Execution Details Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview Google, Inc. MapReduce paper, 2004. Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods: map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(k2,v2) ● All v' with the same k' are reduced together, in order. ● The input keys and values are drawn from a different domain than the output keys and values. MapReduce Programming Model Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Example: Words Frequency “documentx”, “To be or not to be” “be”, 2 “not”, 1 “or”, 1 “to”, 2 Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”, “To be or not to be” “be”, 2 “not”, 1 “or”, 1 “to”, 2 ... “to”, 1 “be”, 1 “or”, 1 “not”, 1 “to”, 1 “be”, 1 key = “be” values = “1”,”1” key = “not” values = “1” key = “or” values = “1” key = “to” values = “1”,”1” ...“document2”, “text” ... ... “be”, 1 “be”, 1 ... “not”, 1 ... “or”, 1 ... “to”, 1 “to”, 1 ... ShuffleandSort:aggregatevaluesbykey Map Reduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index - Find what documents contain a specific word. - Map: parse document, emit <word, document-ID> pairs. - Reduce: for each word, sort the corresponding document Ids. Emit <word, list(document-ID)> • Reverse web-link graph - Find where page links come from. - Map: output <target, source> for each link to target in a page source. - Reduce: concatenate the list of all source URLs associated with a target. Emit <target, list(source)> Others examples Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction ● Really simplifies large-scala computations ● Fun to use: - Focus on problem - Let the library deal with messy details Conclusions on MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS MapReduce HDFS MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing ● It is Open Source (Apache v2 Licence) ● It is a top-level Apache Project ● Written in Java ● Batch processing centric ● Runs on commodity hardware What is Hadoop? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ Hadoop Distributed File System ● For very large files: TBs, PBs. ● Each file is partitioned into chunks of 64MB. ● Each chunk is replicated several times (>=3), on different racks, for fault tolerance. ● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.

Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ Hadoop Architecture ● TaskTracker is the MapReduce server (processing part) ● DataNode is the HDFS server (data part) TaskTracker DataNode Machine

Hadoop Architecture - Master/Slave TaskTracker DataNode JobTracker: ● Accepts users' jobs ● Assigns tasks to workers ● Keeps track of the jobs status TaskTracker DataNode TaskTracker DataNode JobTracker Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave TaskTracker DataNode NameNode: ● Keeps information on data location ● Decides where a file has to be written TaskTracker DataNode TaskTracker DataNode NameNode Data never flows trough the NameNode! Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable TaskTracker DataNode Machine ● Having multiple machine with Hadoop creates a cluster. ● What If we need more storage or compute power? TaskTracker DataNode Machine TaskTracker DataNode Machine Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview B C Client JobTracker NameNode Secondary NameNode A File Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive MapReduce HDFS Pig Hive Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase MapReduce HDFS Pig Hive HBase Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google ● Index construction for Google Search ● Article clustering for Google News ● Statistical machine translation @Yahoo! (4100 nodes) ● “Web map” powering Yahoo! Search ● Spam detection for Yahoo! Mail @Facebook (>100 PB of storage) ● Data mining ● Ad optimization ● Spam detection What is MapReduce/Hadoop used for? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . . . . . most of the data in companies are stored on databases! but . . . Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster ● In-Database Map-Reduce by Oracle ● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases Solutions Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL- MapReduce functions in languages such as Java, C#, Python and C++ and push them into the database for advanced in-database analytics. SQL-MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and can implement any algorithm or transformation. SQL-MapReduce - Syntax http://www.asterdata.com/resources/mapreduce.php Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Demo #2: Why do Reduce when we have SQL? SELECT word, count(*) AS wordcount FROM Tokenize ( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within the database. ● Parallelization is provided by the Oracle Parallel Execution framework. Using this in combination with SQL, Oracle provides an simple mechanism for database developers to develop Map-Reduce functionality using languages they know. In-Database Map-Reduce by Oracle Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT * FROM table(oracle_map_reduce.reducer( cursor( SELECT value(map_result).word word FROM table(oracle_map_reduce.mapper( cursor( SELECT a FROM documents), ' ' ) ) map_result ) )); Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source compatible with Hadoop. Native Hadoop programs need to be rewritten before becoming usable in databases. Still not perfect! Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

MapReduce Tutorial - Apache Hadoop

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in ...
Read more

IBM - What is MapReduce

what is MapReduce? MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of ...
Read more

Welcome to Apache™ Hadoop®!

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is ...
Read more

Apache Hadoop - Wikipedia, the free encyclopedia

Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.
Read more

MapReduce - Hadoop Wiki

MapReduce. MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster. The core concepts are ...
Read more

Intro to Hadoop & MapReduce for Beginners | Udacity

Intro to Hadoop & MapReduce for Beginners teaches the basics of analyzing big data using MapReduce to reveal surprising trends in real world data.
Read more

MapReduce – Wikipedia

MapReduce ist ein vom Unternehmen Google Inc. eingeführtes Programmiermodell für nebenläufige Berechnungen über ... Apache Hadoop MapReduce;
Read more

MapReduce - Wikipedia, the free encyclopedia

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a ...
Read more

Hadoop Streaming and F# MapReduce | Carl's Blog

And now for something completely different. As you may know Microsoft has recently announced plans for a Hadoop adoption for both Windows Server ...
Read more

Using in MapReduce - orc.apache.org

Using in MapReduce. This page describes how to read and write ORC files from Hadoop’s newer org.apache.hadoop.mapreduce MapReduce APIs. If you want to ...
Read more