advertisement

Streaming Distributed Data Processing with Silk #deim2014

50 %
50 %
advertisement
Information about Streaming Distributed Data Processing with Silk #deim2014
Technology

Published on March 3, 2014

Author: taroleo

Source: slideshare.net

Description

A framework written in Scala for describing distributed data processing programs.
advertisement

Streaming Distributed Data Processing with Silk Taro L. Saito University of Tokyo leo@xerial.org March 3rd, 2014 DEIM2014 xerial.org/silk Twitter @taroleo 1

Distributed Data Processing Streaming Distributed Data Processing with Silk  Translate this data processing program A  g f B C into a cluster computing program g f A0 B0 A1 B1 A2 B2 map xerial.org/silk Twitter @taroleo C reduce 2

Streaming Distributed Data Processing Streaming Distributed Data Processing with Silk  What is streaming? A f g B F C G D  E Silk: A framework for building and running complex workflows of distributed data processing xerial.org/silk Twitter @taroleo 3

Problem Definition Streaming Distributed Data Processing with Silk  How do we run the distributed data processing while extending the program? A f g B F C G D xerial.org/silk Twitter @taroleo E 4

Silk Streaming Distributed Data Processing with Silk  Describing Dataflows in Scala  A dataflow in Silk is a sequence of function calls   Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object xerial.org/silk Twitter @taroleo 5

Object-Oriented Dataflow Programming Streaming Distributed Data Processing with Silk  Reusing and overriding dataflow programs xerial.org/silk Twitter @taroleo 6

Big Data Volumes in Human Genome Analysis Streaming Distributed Data Processing with Silk Input: FASTQ file(s) 500GB (50x coverage, 200 million entries)  DNA Sequencer (Illumina, PacBio, etc.)    f: An alignment program Output: Alignment results 750GB (sequence + alignment data)   Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs) Input f Output University of Tokyo Genome Browser (UTGB) xerial.org/silk Twitter @taroleo 7

Varieties of Scientific Data and Analysis Streaming Distributed Data Processing with Silk  WormTSS: http://wormtss.utgenome.org/  Integrating various data sources, hundreds of data analysis… xerial.org/silk Twitter @taroleo 8

Produced Thousands of Data Analysis Charts Streaming Distributed Data Processing with Silk Using R, JFreeChart, etc. Need a automated pipeline to redo the entire analysis for answering the paper review within a month. xerial.org/silk Twitter @taroleo 9

Writing A Dataflow Streaming Distributed Data Processing with Silk a Program v1 f A B val B = A.map(f)  Apply function f to the input A, then produce the output B  This step may take more than 1 hours in big data analysis xerial.org/silk Twitter @taroleo 10

Distribution and Fault Tolerance Streaming Distributed Data Processing with Silk  Resume only B2 = A2.map(f) a Program v1 f A B f A0 B0 A1 B1 A2 B2 Failure! xerial.org/silk Twitter @taroleo Retry 11

Extending Dataflows Streaming Distributed Data Processing with Silk Program v2 Program v1 A   f g B C While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C? xerial.org/silk Twitter @taroleo 12

Marking to A Program Streaming Distributed Data Processing with Silk Program v2 Program v1 A f g B C val B = A.map(f) val C = B.map(g)  Storing intermediate results using variable names  variable names := program markers!!  But, we lost variable names after compilation  Extracting AST and variable names upon compile time  Using Scala Macros (Since Scala 2.10) xerial.org/silk Twitter @taroleo 13

Scala Program (AST) to DAG Schedule (Logical Plan) Streaming Distributed Data Processing with Silk Program v2 Program v1 A  g B C Translating a program (AST) into a set of Silk operations (DAG)    f val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g) Operations in Silk can be nested  val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g) xerial.org/silk Twitter @taroleo 14

Weaving Silks Streaming Distributed Data Processing with Silk In-memory weaver Cluster weaver Result Hadoop weaver Silk[A] (operation DAG)  Weave Output Data analysis code is independent from weavers xerial.org/silk Twitter @taroleo 15

Cluster Weaver: Logical Plan to Physical Plan on Cluster Streaming Distributed Data Processing with Silk  Logical plan   GroupByOp(in:people, out:g, key: {_.dept.id}) Physical plan P1 Partition (hashing) S1 P1 S3 S1 P1 S1 S2 P2 P2 S2 S2 P2 S3 S2 P2 S1 S3 P3 P2 S2 S3 P3 P3 Scatter S2 P1 I3 P2 P3 I2 P1 P1 Silk[people] S1 P3 I1 S1 S3 S3 P3 serialization shuffle deserialization xerial.org/silk Twitter @taroleo R1 R2 R3 merge sort 16

Local machine Local ClassBox User program builds workflows Weaving Silk materializes objects classpaths & local jar files • • • • • Silk[A] Silk[A] read file, toSilk map, reduce, join, groupBy UNIX commands etc. SilkSingle[A] SilkSeq[A] weave weave Static optimization A DAG Schedule single object • • Cluster • • • • Dispatches tasks to clients Manages master resource table Authorizes resource allocation Automatic recovery by leader election in ZK Register ClassBox Submit schedule ZooKeeper ensemble mode (at least 3 ZK instances) Silk Master • • dispatch • • Silk Client Silk Client Task Scheduler Task Scheduler Task Executor Task Executor Resource Monitor Resource Monitor Data Server Data Server Leader election Collects locations of slices and ClassBox jars Watches active nodes Watches available resources Seq[A] sequence of objects Node Table Slice Table Task Status Resource Table (CPU, memory) ClassBox Table • • • • • • • • Submits tasks Run-time optimization Resource allocation Monitoring resource usage Launches Web UI Manages assigned task status Object serialization/deserialization Serves slice data xerial.org/silk Twitter @taroleo 17

Static Optimization Streaming Distributed Data Processing with Silk  Tree transformation     map(f).map(g) => map(g・f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) (Reduces intermediate data) Pushing-down selection Retrieves only accessed fields in an object   Analyzing the byte code of functions with ASM Rewriting logical plans using pattern matching in Scala  Easy to add optimization rules xerial.org/silk Twitter @taroleo 18

Run-time Optimization Streaming Distributed Data Processing with Silk  Adjusting the number of data splits  According to the available cluster resources.  Multi-core execution  Omega-based task scheduler  Sharing the cluster resource table between nodes   Each node determines how to use the resource Monitoring actual CPU/memory resources periodically xerial.org/silk Twitter @taroleo 19

UNIX Command Workflows in Silk Streaming Distributed Data Processing with Silk  c”(UNIX Command)” xerial.org/silk Twitter @taroleo 20

Buffer Management Streaming Distributed Data Processing with Silk   Silk frequently uses distributed memory (like Spark) LArray[A]     Immediate memory deallocation (free)   To eliminate OutOfMemoryException and GC-stall Fast memory allocation   Allocating Off-heap (outside JVM heap)memories sun.misc.Unsafe Github: https://github.com/xerial/larray Skips zero-filling Object Serialization  Extending msgpack    Scala Pickling Inject ser/dser codes Off-heap objects xerial.org/silk Twitter @taroleo 21

Summary Streaming Distributed Data Processing with Silk  Silk  A framework for distributed data processing for all data scientists   Object-oriented data processing programming   Similar to query optimization in DBMS Analyze Data as You Write Programs!   Reuse, override and mix-in Optimizing data flow programs   including non-experts in distributed data processing (e.g. Biologists) Database research now enters program optimization. In Future  Workflow queries    Making queries against dataflow program Monitoring intermediate results Multi-user program execution xerial.org/silk Twitter @taroleo 22

http://xerial.org/silk Streaming Distributed Data Processing with Silk xerial.org/silk Twitter @taroleo 23

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Streaming Distributed Data Processing with Silk #deim2014

moguno - 『Streaming Distributed Data Processing with Silk #deim2014』へのコメント
Read more

xerial/silk · GitHub - GitHub · Build software better ...

silk - Streaming distributed data processing. silk - Streaming distributed data processing. Skip to content. ... xerial / silk. Code; Issues; Pull Requests ...
Read more

Silk - Streaming Cluster Computing | Silk Weaver

Silk: Streaming Cluster Computing. Silk is an ... A data analysis program in Silk uses distributed data set Silk ... Streaming Distributed Data Processing.
Read more

Software | xerial.org

JDBC driver for using SQLite ... A cluster computing platform for streaming distributed data processing pipelines. GitHub: https://github.com/xerial/silk;
Read more

Stream processing - Wikipedia, the free encyclopedia

Stream processing is a computer programming paradigm, ... Streaming algorithm; Data stream mining; ... distributed shared; MPP;
Read more