Published on February 25, 2014
Building scalable, ﬂexible data pipelines for Big Data Vivek Aanand Ganesan email@example.com 1
Agenda § Introduction § Dealing with Legacy § Data Lineage and Provenance § Data Lifecycle Management § Data Pipeline Engineering for fun and profit 2
Big Data Introduction Current state of Big Data Landscape § Hadoop § Solves for the three V’s:Volume,Velocity, and Variety § Primarily batch processing for large data sets § Hadoop2 YARN: distributed computing platform § Not only Hadoop! § Real-time systems: Storm, Spark, Samza etc. § Wide variety of NoSQL systems: Cassandra, Riak, etc. § Don’t forget Legacy! 3
Big Data Promise Why is Big Data so hot? § This is what the Big Data vendors sell: § Throw some data in § Analyze it using map/reduce § Visualize your analytics/Generate insights § Do some Predictive Analytics or Recommendations § Profit!! § Rinse and Repeat!!! 4
Big Data Problems Why is Big Data so hard? § Real-life environments are not that simple! § For instance, privacy and compliance issues § Extract, Transform and Load is non-trivial § Building reliable ingest across complex environments § Data Lifecycle Management is not mature yet 5
Legacy Data Why are Legacy environments important for Big Data? § Outside of Silicon Valley: § Companies have been around for a while § Have lots of valuable legacy data § Some of it in Mainframes § Some of it in flat files § Some of it in relational DBs 6
Mainframe Data How would you handle Mainframe data? § The open source Hadoop eco system does not provide a way to import data from main frames § Only a commercial solution available as of today § Think about that for a second 7
More on Mainframes Why worry about Mainframe data? § Mainframes still run important systems § Separates schema from data (kinda like Hadoop) § COBOL Copybooks § Hadoop can offload legacy data processing § But, you must first get the data in!!! 8
Other Legacy issues Random collection of issues in dealing with legacy data § Unknown or incorrigible schema § Invalid data § Inconsistent data § Missing data § Fuzzy data § Sparse data 9
Big Data ETL What is the problem with it? § First of all the name § Extract, Transform, Load was written in the old days when data sets were smaller § Inherent assumption that the Transformation will happen out of band § Assumption does not hold for Big Data! 10
ELT Will ELT solve the problem? § Flip the transform and load steps § Get the data in and then transform it § This way the transform is not out of band § Leverage the power of the underlying Big Data platform to do the transform § Makes perfect sense … except when 11
Privacy and Security Issues with ELT approach for privacy and security § Loading raw data before transforming it poses privacy and security challenges § What if the raw data contains SSNs or Credit card numbers? § What if it is only meant to be seen by a few? § Once you load, the data is now available 12
The solution Deal with it during extraction (as best as you can) § Do a secure extract § Perform a security/privacy audit of the raw data and build in rules to mask/anonymize/ scrub data during the extraction § Somewhat solves the security problem but complicates the Extract step 13
Some exceptions What if you don’t know which parts of the data set need to protected? § Secure extract assumes that the data schema is known and the privacy levels are known § Not a valid assumption at all times § For e.g., what if the legacy data set has Facebook profile data before the new privacy rules went in to effect? 14
Data Lineage and Provenance What is data lineage and data provenance? Data Lineage Data Lineage records the origin of the data set. This includes the time, place, original format and privacy/security information. Data Provenance Records all the change history to the data set. This includes timestamp, change agent, purpose, process and edit log. 15
Data Lineage Why is this a big deal? § Let’s go back to the Facebook problem § The solution is to record lineage information § This protects the consumer of the data set – assures that the data was available for use as of the point and time of origin § Protect yourself from law suites and fines! 16
Metadata Data about the data § Astute observation: Metadata extraction is an integral part of managing data and implementing data lineage and data provenance § It can be rule-based but increasingly more automated systems are desirable 17
Data Provenance Why is it important? § Data Lineage solves one piece of the puzzle – namely, origin and metadata § What if data is changed during or after the extract step? § For purposes of audit and traceability, this must be recorded! 18
Data Provenance Approach How to implement data provenance? § Can be workflow-based or dataflow-based § Workflow-based is much easier § Records the changes as part of the workflow § Dataflow-based is much harder § Needs to record each and every access to 19 the data
Current Toolset What exists currently in the open source big data ecosystem? § Nothing really to help with any of this § There are commercial products § But, no open source tools yet (or at least none that are in production use that I am aware of) § Would be a great idea to build one! 20
Data Lifecycle Management Dealing with data throughout its lifecycle § Management of data from ingest to sunset § It involves dealing with all of the associated metadata, lineage and provenance artifacts § It also involves moving data around (large datasets in the Big Data world) § That is a data pipeline problem! 21
Data Lifecycle Management Tools What exists in the Big Data eco system to handle this? § Current toolset is pretty limited § Apache Falcon (Hadoop sub-project) is a step in this direction but still not widely available for production use § It is possible to roll your own § But, it is a significant engineering effort 22
Modern Data Lifecycle Management Modern data architecture needs modern data lifecycle management § Modern data architecture involves more than just Hadoop § Queuing systems – for e.g. Kafka § Stream processing – for e.g. Storm § Real-time systems – for e.g. Spark § NoSQL system – for e.g. HBase § Integration with MPP systems 23
Data Lifecycle Management done right Data Lifecycle Management across the Big Data Environment § Dealing with the various systems in the Big Data Landscape § Ability to setup schedules and periodic runs § Also, provide on-demand data processing § Treat data as an asset – apply asset management practices 24
Data Pipelining for fun and proﬁt Dealing with data pipelines as a distinct role in the Big Data Engineering world § Data Pipeline Engineering is a legitimate role in the Big Data environment § The complexity and all of the attendant issues makes it a specialty in its own right! § It is much more than just ETL § Security, Lineage, Provenance and Lifecycle 25 Management are all essential
So you want to be a data pipeline engineer? What are the tools of the trade and ninja skill to master? § Languages § Systems § Python § Java § Sqoop § Scala § Storm § Pig 26 § Flume § Hive/HBase
Can you make this easier? I just want to write some code and be done with it § Pick your language § Cascading § Java § Data Pipeline § Scala framework § Clojure § Full-featured § But, wait! § What about all the other stuff? 27
Integration and Extensions Integrate with your favorite tools and extend when needed § Start with a solid pipeline framework like Cascading (or its offshoots like Scalding or Cascalog) § Integrate with either commercial or open source tools for specific functionality needed § Look at Cascading extensions: Lingual, Driven and Load 28
Build your own extensions Extend Cascading with your own requirements § A programming framework such as Cascading makes it much easier to extend to build custom data lineage, provenance and lifecycle management solutions § You can also integrate with Security and Privacy solutions § This is a flexible approach 29
Build for scale Understanding scale for data pipelines § Scaling data pipelines is quite complex due to the multiple moving pieces § Pipeline is only as fast as the slowest piece § Hadoop scales -> proven § Flume scales -> proven § Kafka scales -> proven 30
Scaling Sqoop Scaling relation DB load § What about Sqoop? § Not as easy or straight forward to scale § Start slow and incrementally increase load § Watch for network statistics and optimize § Load aggregates if that is all you need § Parallelize as much without killing the DB 31
Scaling Storm § Lot of these systems depend on ZK § Storm also relies on Zero MQ (this is changing) § Provision for average load (not peak load) § Benchmark with typical event size (compress for larger events) § Storm on YARN will solve many issues 32
Scaling Tips § Measure end-to-end throughput § Benchmark and fine-tune the best performing parts of the pipeline first § Scale the slower parts next – increase incrementally § Batch the slower parts – aggregate if you can and parallelize as much as possible 33
Summary § Pipeline Engineering will be one of the most challenging areas in Big Data with several big issues remaining to be solved § Expect plenty of innovation and action in this space § It is a great place to start a Big Data career 34
Thank You Questions? Comments? Thank You! Please contact firstname.lastname@example.org with your questions and/or comments. 35
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
This week, Microsoft announced the public preview of a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible.
Learn how to build a big data pipeline using Azure Data Factory to move Web log data to Azure Data Lake Store, then process that data using a U-SQL script ...
Microsoft announced a new and expanded Azure Data Lake making big data processing and analytics simpler and more accessible.
Reliable Big Data Pipelines
Wenn Unternehmen an die Nutzung von Big Data denken, stehen zusätzliche Speicherkapazitäten, neue Analyse-Tools, Cloud-Services und die Suche nach ...
Erfahren Sie, wie Sie die ASP.NET-Pipeline als MapReduce-Pipeline verwenden, um Ihren vorhandenen Anwendungen umfangreiche Datenanalysen hinzuzufügen, zum ...
Teradata has teamed up with Hortonworks, the Hadoop spin-off that came out of Yahoo, to build a data pipeline and cooperative data exchange tools between ...
Building a Hadoop data pipeline – Where to start? By: Mike Sharkey ... The technology mentioned above work well for our “Big Data” pipeline needs.
Creating an integrated pipeline for big data workflows is complex. Read about several factors to consider.
Costin Leau discusses Big Data, current available tools for dealing with it, and how Spring can be used to create Big Data pipelines.