OCF.tw's talk about "Introduction to spark"

50 %
50 %
Information about OCF.tw's talk about "Introduction to spark"
Technology

Published on September 26, 2014

Author: thegiivee

Source: slideshare.net

Description

在 OCF and OSSF 的邀請下分享一下 Spark

If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/

另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/

Introduction to Spark Wisely Chen (aka thegiive) Sr. Engineer at Yahoo

Agenda • What is Spark? ( Easy ) • Spark Concept ( Middle ) • Break : 10min • Spark EcoSystem ( Easy ) • Spark Future ( Middle ) • Q&A

Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer in Yahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Spark Summit 2014 San Francisco • Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning

Forecast Recommendation

HADOOP

Opinion from Cloudera • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From http://0rz.tw/y3OfM

What is Spark • From UC Berkeley AMP Lab • Most activity Big data open source project since Hadoop

Community

Community

Where is Spark?

YARN HDFS MapReduce Hadoop 2.0 Storm HBase Others

Hadoop Architecture Hive MapReduce YARN HDFS SQL Computing Engine Resource Management Storage

Hadoop vs Spark Hive Shark/SparkSQL YARN HDFS MapReduce Spark

Spark vs Hadoop • Spark run on Yarn, Mesos or Standalone mode • Spark’s main concept is based on MapReduce • Spark can read from • HDFS: data locality • HBase • Cassandra

More than MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)

Why Spark?

天下武功,無堅不破,惟快不破

Logistic regression 3 110 82.5 55 27.5 33 106 180 135 90 45 171 3X~25X than MapReduce framework ! From Matei’s paper: http://0rz.tw/VVqgP Running Time(S) 80 60 40 20 0 76 MR Spark KMeans 0 MR Spark PageRank 0 23 MR Spark

What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing

Language Support • Python • Java • Scala

Python Word Count • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...") Access data via Spark API Process via Python

What is Spark • Apache Spark™ is a very fast and general engine for large-scale data processing

Why is Spark so fast?

Most machine learning algorithms need iterative computing

a 1.0 1.0 1.0 1.0 PageRank b b 1st Iter 2nd Iter 3rd Iter b d c Rank Tmp Result Rank Tmp Result a 1.85 1.0 0.58 d c 0.58 a 1.31 1.72 0.39 d c 0.58

HDFS is 100x slower than memory Input (HDFS) Iter 1 Tmp (HDFS) Iter 2 Tmp (HDFS) Iter N Input (HDFS) Iter 1 Tmp (Mem) Iter 2 Tmp (Mem) Iter N MapReduce Spark

3rd iteration(mem)! take 7.7 sec 2nd iteration(mem)! take 7.4 sec First iteration(HDFS)! take 200 sec Page Rank algorithm in 1 billion record url

Spark Concept

Map Reduce Shuffle

DAG Engine

DAG Engine

RDD • Resilient Distributed Dataset • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations

Fault Tolerance 天下武功,無堅不破,惟快不破

RDD val b = a.filer( line=>line.contain(“Spark”) ) RDD a RDD b val a =sc.textFile(“hdfs://....”) Value c val c = b.count() Transformation Action

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! Tas! k Worker! ! ! ! Task Task

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! RDD a Bloc! k1 Worker! ! ! ! RDD a Bloc! k3 Worker! ! ! ! RDD a Bloc! k2

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block3 Worker! ! ! ! ! RDD err Block1 Block2

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache3 Worker! ! ! ! ! RDD err Cache1 Cache2

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache3 Worker! ! ! ! ! RDD m Cache1 Cache2

Log mining val a = sc.textfile(“hdfs://aaa.com/a.txt”)! val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)! ! err.cache()! err.count()! ! val m = err.filter( t=> t.contains(“MYSQL”) )! ! ! .count()! val a = err.filter( t=> t.contains(“APACHE”) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache3 Worker! ! ! ! ! RDD a Cache1 Cache2

RDD Cache with cache! take 7 sec 1st iteration(no cache)! take same time

RDD Cache • Data locality • Cache After cache, take only 265ms A big shuffle! take 20min self join 5 billion record data

Scala Word Count • val file = spark.textFile("hdfs://...") • val counts = file.flatMap(line => line.split(" ")) • .map(word => (word, 1)) • .reduceByKey(_ + _) • counts.saveAsTextFile("hdfs://...")

Step by Step • file.flatMap(line => line.split(" “)) => (aaa,bb,cc) • .map(word => (word, 1)) => ((aaa,1),(bb,1)..) • .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Java Wordcount • JavaRDD<String> file = spark.textFile("hdfs://..."); • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } • }); • JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>() • public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } • }); • JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() • public Integer call(Integer a, Integer b) { return a + b; } • }); • counts.saveAsTextFile("hdfs://...");

Java vs Scala • Scala : file.flatMap(line => line.split(" ")) • Java version : • JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() • public Iterable<String> call(String s) { • return Arrays.asList(s.split(" ")); } • });

Python • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...")

Highly Recommend • Scala : Latest API feature, Stable • Python • very familiar language • Native Lib: NumPy, SciPy

How to use it? • 1. go to https://spark.apache.org/ • 2. Download and unzip it • 3. ./sbin/start-all.sh or ./bin/spark-shell

DEMO

EcoSystem/Future

Hadoop EcoSystem

Hadoop EcoSystem

Spark ECOSystem SparkSQL: Hive GraphX: Pregel MLib: Mahout Spark Core : MapReduce HDFS Streaming: Storm Resource Management System(Yarn, Mesos)

Unified Platform

Detail Streaming BI ETL Spark SparkSQL MLlib Hive HDFS Cassandra RDBMS

Complexity

Performance

Write once, Run use case

BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark

Spark bridge people together

Data Analyst Data Engineer Data Scientist

Bridge people together • Scala : Engineer • Java : Engineer • Python : Data Scientist , Engineer • R : Data Scientist , Data Analyst • SQL : Data Analyst

Yahoo EC team Data Platform! ! ! ! ! ! ! ! ! ! Filtered Data! (HDFS) Data Mart! (Oracle) ML Model! (Spark) BI Report! (MSTR) Traffic! Data Transaction! Data Shark

Data Analyst

Data Analyst 350 TB data • Select tweet from tweets_data where similarity(tweet , “FIFA” ) > 0.01 Machine ! Learning • = ! • http://youtu.be/lO7LhVZrNwA?list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr https://www.youtube.com/watch?v=lO7LhVZrNwA&list=PL-x35fyliRwiST9gF7Z8Nu3LgJDFRuwfr#t=2900

Data Scientist http://goo.gl/q5CAx8 http://research.janelia.org/zebrafish/

SQL (Data Analyst) Cloud Computing (Data Engineer) Machine Learning (Data Scientist) Spark

Databricks Cloud DEMO

BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark

Instant BI Report http://youtu.be/dJQ5lV5Tldw?t=30m30s

BI (SparkSQL) Streaming (SparkStreaming) Machine Learning (MLlib) Spark

Background Knowledge • Tweet real time data store into SQL database • Spark MLLib use Wikipedia data to train a TF-IDF model • SparkSQL select tweet and filter by TF-IDF model • Generate live BI report

Code • val wiki = sql(“select text from wiki”) • val model = new TFIDF() • model.train(wiki) • registerFunction(“similarity” , model.similarity _ ) • select tweet from tweet where similarity(tweet, “$search” > 0.01 )

DEMO http://youtu.be/dJQ5lV5Tldw?t=39m30s

Q & A

Add a comment

Related presentations