Big data pipelines

50 %
50 %
Information about Big data pipelines

Published on February 25, 2014

Author: vivekganesan



How to build flexible and scalable Big Data pipelines

Building scalable, flexible data pipelines for Big Data Vivek Aanand Ganesan 1

Agenda §  Introduction §  Dealing with Legacy §  Data Lineage and Provenance §  Data Lifecycle Management §  Data Pipeline Engineering for fun and profit 2

Big Data Introduction Current state of Big Data Landscape §  Hadoop §  Solves for the three V’s:Volume,Velocity, and Variety §  Primarily batch processing for large data sets §  Hadoop2 YARN: distributed computing platform §  Not only Hadoop! §  Real-time systems: Storm, Spark, Samza etc. §  Wide variety of NoSQL systems: Cassandra, Riak, etc. §  Don’t forget Legacy! 3

Big Data Promise Why is Big Data so hot? §  This is what the Big Data vendors sell: §  Throw some data in §  Analyze it using map/reduce §  Visualize your analytics/Generate insights §  Do some Predictive Analytics or Recommendations §  Profit!! §  Rinse and Repeat!!! 4

Big Data Problems Why is Big Data so hard? §  Real-life environments are not that simple! §  For instance, privacy and compliance issues §  Extract, Transform and Load is non-trivial §  Building reliable ingest across complex environments §  Data Lifecycle Management is not mature yet 5

Legacy Data Why are Legacy environments important for Big Data? §  Outside of Silicon Valley: §  Companies have been around for a while §  Have lots of valuable legacy data §  Some of it in Mainframes §  Some of it in flat files §  Some of it in relational DBs 6

Mainframe Data How would you handle Mainframe data? §  The open source Hadoop eco system does not provide a way to import data from main frames §  Only a commercial solution available as of today §  Think about that for a second 7

More on Mainframes Why worry about Mainframe data? §  Mainframes still run important systems §  Separates schema from data (kinda like Hadoop) §  COBOL Copybooks §  Hadoop can offload legacy data processing §  But, you must first get the data in!!! 8

Other Legacy issues Random collection of issues in dealing with legacy data §  Unknown or incorrigible schema §  Invalid data §  Inconsistent data §  Missing data §  Fuzzy data §  Sparse data 9

Big Data ETL What is the problem with it? §  First of all the name §  Extract, Transform, Load was written in the old days when data sets were smaller §  Inherent assumption that the Transformation will happen out of band §  Assumption does not hold for Big Data! 10

ELT Will ELT solve the problem? §  Flip the transform and load steps §  Get the data in and then transform it §  This way the transform is not out of band §  Leverage the power of the underlying Big Data platform to do the transform §  Makes perfect sense … except when 11

Privacy and Security Issues with ELT approach for privacy and security §  Loading raw data before transforming it poses privacy and security challenges §  What if the raw data contains SSNs or Credit card numbers? §  What if it is only meant to be seen by a few? §  Once you load, the data is now available 12

The solution Deal with it during extraction (as best as you can) §  Do a secure extract §  Perform a security/privacy audit of the raw data and build in rules to mask/anonymize/ scrub data during the extraction §  Somewhat solves the security problem but complicates the Extract step 13

Some exceptions What if you don’t know which parts of the data set need to protected? §  Secure extract assumes that the data schema is known and the privacy levels are known §  Not a valid assumption at all times §  For e.g., what if the legacy data set has Facebook profile data before the new privacy rules went in to effect? 14

Data Lineage and Provenance What is data lineage and data provenance? Data Lineage Data Lineage records the origin of the data set. This includes the time, place, original format and privacy/security information. Data Provenance Records all the change history to the data set. This includes timestamp, change agent, purpose, process and edit log. 15

Data Lineage Why is this a big deal? §  Let’s go back to the Facebook problem §  The solution is to record lineage information §  This protects the consumer of the data set – assures that the data was available for use as of the point and time of origin §  Protect yourself from law suites and fines! 16

Metadata Data about the data §  Astute observation: Metadata extraction is an integral part of managing data and implementing data lineage and data provenance §  It can be rule-based but increasingly more automated systems are desirable 17

Data Provenance Why is it important? §  Data Lineage solves one piece of the puzzle – namely, origin and metadata §  What if data is changed during or after the extract step? §  For purposes of audit and traceability, this must be recorded! 18

Data Provenance Approach How to implement data provenance? §  Can be workflow-based or dataflow-based §  Workflow-based is much easier §  Records the changes as part of the workflow §  Dataflow-based is much harder §  Needs to record each and every access to 19 the data

Current Toolset What exists currently in the open source big data ecosystem? §  Nothing really to help with any of this §  There are commercial products §  But, no open source tools yet (or at least none that are in production use that I am aware of) §  Would be a great idea to build one! 20

Data Lifecycle Management Dealing with data throughout its lifecycle §  Management of data from ingest to sunset §  It involves dealing with all of the associated metadata, lineage and provenance artifacts §  It also involves moving data around (large datasets in the Big Data world) §  That is a data pipeline problem! 21

Data Lifecycle Management Tools What exists in the Big Data eco system to handle this? §  Current toolset is pretty limited §  Apache Falcon (Hadoop sub-project) is a step in this direction but still not widely available for production use §  It is possible to roll your own §  But, it is a significant engineering effort 22

Modern Data Lifecycle Management Modern data architecture needs modern data lifecycle management §  Modern data architecture involves more than just Hadoop §  Queuing systems – for e.g. Kafka §  Stream processing – for e.g. Storm §  Real-time systems – for e.g. Spark §  NoSQL system – for e.g. HBase §  Integration with MPP systems 23

Data Lifecycle Management done right Data Lifecycle Management across the Big Data Environment §  Dealing with the various systems in the Big Data Landscape §  Ability to setup schedules and periodic runs §  Also, provide on-demand data processing §  Treat data as an asset – apply asset management practices 24

Data Pipelining for fun and profit Dealing with data pipelines as a distinct role in the Big Data Engineering world §  Data Pipeline Engineering is a legitimate role in the Big Data environment §  The complexity and all of the attendant issues makes it a specialty in its own right! §  It is much more than just ETL §  Security, Lineage, Provenance and Lifecycle 25 Management are all essential

So you want to be a data pipeline engineer? What are the tools of the trade and ninja skill to master? §  Languages §  Systems §  Python §  Java §  Sqoop §  Scala §  Storm §  Pig 26 §  Flume §  Hive/HBase

Can you make this easier? I just want to write some code and be done with it §  Pick your language §  Cascading §  Java §  Data Pipeline §  Scala framework §  Clojure §  Full-featured §  But, wait! §  What about all the other stuff? 27

Integration and Extensions Integrate with your favorite tools and extend when needed §  Start with a solid pipeline framework like Cascading (or its offshoots like Scalding or Cascalog) §  Integrate with either commercial or open source tools for specific functionality needed §  Look at Cascading extensions: Lingual, Driven and Load 28

Build your own extensions Extend Cascading with your own requirements §  A programming framework such as Cascading makes it much easier to extend to build custom data lineage, provenance and lifecycle management solutions §  You can also integrate with Security and Privacy solutions §  This is a flexible approach 29

Build for scale  Understanding scale for data pipelines §  Scaling data pipelines is quite complex due to the multiple moving pieces §  Pipeline is only as fast as the slowest piece §  Hadoop scales -> proven §  Flume scales -> proven §  Kafka scales -> proven 30

Scaling Sqoop Scaling relation DB load §  What about Sqoop? §  Not as easy or straight forward to scale §  Start slow and incrementally increase load §  Watch for network statistics and optimize §  Load aggregates if that is all you need §  Parallelize as much without killing the DB 31

Scaling Storm §  Lot of these systems depend on ZK §  Storm also relies on Zero MQ (this is changing) §  Provision for average load (not peak load) §  Benchmark with typical event size (compress for larger events) §  Storm on YARN will solve many issues 32

Scaling Tips §  Measure end-to-end throughput §  Benchmark and fine-tune the best performing parts of the pipeline first §  Scale the slower parts next – increase incrementally §  Batch the slower parts – aggregate if you can and parallelize as much as possible 33

Summary §  Pipeline Engineering will be one of the most challenging areas in Big Data with several big issues remaining to be solved §  Expect plenty of innovation and action in this space §  It is a great place to start a Big Data career 34

Thank You Questions? Comments? Thank You! Please contact with your questions and/or comments. 35

Add a comment

Related presentations

Related pages

Creating big data pipelines using Azure Data Lake and ...

This week, Microsoft announced the public preview of a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible.
Read more

Big Data - Creating Big Data Pipelines Using Azure Data ...

Learn how to build a big data pipeline using Azure Data Factory to move Web log data to Azure Data Lake Store, then process that data using a U-SQL script ...
Read more

Create big data pipelines using Azure Data Lake and Azure ...

Microsoft announced a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible.
Read more

Zipline: Big Data Pipelines

Reliable Big Data Pipelines
Read more

Anforderungen an Big Data Pipelines - ...

Wenn Unternehmen an die Nutzung von Big Data denken, stehen zusätzliche Speicherkapazitäten, neue Analyse-Tools, Cloud-Services und die Suche nach ...
Read more

Big Data – MapReduce ohne Hadoop unter Verwendung der ...

Erfahren Sie, wie Sie die ASP.NET-Pipeline als MapReduce-Pipeline verwenden, um Ihren vorhandenen Anwendungen umfangreiche Datenanalysen hinzuzufügen, zum ...
Read more

Teradata Builds Big Data Pipeline To Hadoop - InformationWeek

Teradata has teamed up with Hortonworks, the Hadoop spin-off that came out of Yahoo, to build a data pipeline and cooperative data exchange tools between ...
Read more

Building a Hadoop data pipeline – Where to start?

Building a Hadoop data pipeline – Where to start? By: Mike Sharkey ... The technology mentioned above work well for our “Big Data” pipeline needs.
Read more

Manage complex big data pipeline challenges with these ...

Creating an integrated pipeline for big data workflows is complex. Read about several factors to consider.
Read more

How to Build Big Data Pipelines for Hadoop Using OSS

Costin Leau discusses Big Data, current available tools for dealing with it, and how Spring can be used to create Big Data pipelines.
Read more