advertisement

Real-Time Data Loading from MySQL to Hadoop

50 %
50 %
advertisement
Information about Real-Time Data Loading from MySQL to Hadoop
Technology

Published on February 27, 2014

Author: Continuent_Tungsten

Source: slideshare.net

Description

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. Continuent Tungsten Replicator provides enterprise-quality replication from MySQL to Hadoop under a GPL V2 license. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself.

This webinar covers the following topics:

- How Hadoop works and why it's useful for processing transaction data from MySQL
- Setting up Continuent Tungsten replication from MySQL to Hadoop
- Transforming MySQL data within Hadoop to enable efficient analytics
- Tuning replication to maximize performance.

You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten. The software we are discussing is 100% open source and available from the Tungsten Replicator website at code.google.com.
advertisement

Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten Robert Hodges, CEO ©Continuent 2014

Introducing Continuent ©Continuent 2014 2

Introducing Continuent • The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement ©Continuent 2014 3

Quick Continuent Facts • Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data • Tungsten Replicator was application of the year at the 2011 MySQL User Conference • Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now • MySQL to Hadoop deployments are now in progress with multiple customers ©Continuent 2014 4

Selected Continuent Customers 23 ©Continuent 2014 5

Five Minute Hadoop Introduction ©Continuent 2014 6

What Is Hadoop, Exactly? a.A distributed file system b.A method of processing massive quantities of data in parallel c.The Cutting family’s stuffed elephant d.All of the above ©Continuent 2014 7

Hadoop Distributed File System hadoop command Find file NameNode (directory) Java Client Hive Read block(s) Pig DataNodes (replicated data) ©Continuent 2014 8

Map/Reduce Acme,2013,4.75! Spitze,2013,25.00! Acme,2013,55.25! Excelsior,2013,1.00! Spitze,2013,5.00 Acme,60.00! Excelsior,1.00! Spitze,30.00 MAP REDUCE Spitze,2014,60.00! Spitze,2014,9.50! Acme,2014,1.00! Acme,2014,4.00! Excelsior,2014,1.00! Excelsior,2014,9.00 ©Continuent 2014 Acme,5.00! Excelsior,10.00! Spitze,69.50 MAP 9 Acme,65.00! Excelsior,11.00! Spitze,99.50

Typical MySQL to Hadoop Use Case Hive (Analytics) Initial Load? Changes? Materialized views? App changes? Transaction Processing ©Continuent 2014 App load? Latency? 10 Hadoop Cluster

Options for Loading Data Manual Loading Sqoop CSV Files ©Continuent 2014 Tungsten Replicator Sqoop 11

Comparing Methods in Detail Manual via CSV Process Incremental Loading Latency Sqoop Tungsten Replicator Manual/ Scripted Manual/ Scripted Fully automated Possible with Requires DDL DDL changes changes Full-load Intermittent Fully supported Real-time Extraction Full and partial Low-impact Full table scan table scans binlog scan Requirements ©Continuent 2014 12

Replicating MySQL Data to Hadoop using Tungsten Replicator ©Continuent 2014 13

What is Tungsten Replicator? A real-time, high-performance, open source database replication engine ! GPL V2 license - 100% open source Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent “Golden Gate® without the Price Tag” ©Continuent 2014 14

Tungsten Replicator Overview Master Replicator Extract transactions from log DBMS Logs (Transactions + Metadata) Slave Replicator THL (Transactions + Metadata) Apply ©Continuent 2014 THL 15

Tungsten Replicator 3.0 & Hadoop • • Extract from MySQL or Oracle • • • • • Provision using Sqoop or parallel extraction ©Continuent 2014 Base Hadoop plus commercial distributions: Cloudera and HortonWorks Automatic replication of incremental changes Transformation to preferred HDFS formats Schema generation for Hive Tools for generating materialized views 16

Basic MySQL to Hadoop Replication Access via Hive MySQL binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Extract from MySQL binlog ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files 17 Hadoop Cluster

Hadoop Data Loading - Gory Details (Generate Table Definitions) Replicator Transactions from master Base Tables Base Tables Materialized Views hadoop Write data to CSV CSV CSV CSV Files Files Files Javascript load script e.g. hadoop.js ©Continuent 2014 Staging Staging Staging Tables Tables “Tables” Load using hadoop command (Generate Table Definitions) 18 (Run Map/ Reduce)

Demo #1 ! Replicating sysbench data ©Continuent 2014 19

Viewing MySQL Data in Hadoop ©Continuent 2014 20

Generating Staging Table Schema $ ddlscan -template ddl-mysql-hive-0.10-staging.vm ! -user tungsten -pass secret ! -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.stage_xxx_sbtest;! ! CREATE EXTERNAL TABLE db01.stage_xxx_sbtest! (! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''! LINES TERMINATED BY 'n'! STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest'; ©Continuent 2014 21

Generating Base Table Schema $ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten ! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.sbtest;! ! CREATE TABLE db01.sbtest! (! id INT ,! k INT ,! c STRING ,! pad STRING )! ;! ©Continuent 2014 22

Creating a Materialized View in Theory Log #1 Log #2 ... Log #N MAP Sort by key(s), transaction order REDUCE Emit last row per key if not a delete ©Continuent 2014 23

Creating a Materialized View in Hive $ hive! ...! hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/ tungsten-reduce;! hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id! ) map1! INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;! ... MAP REDUCE ©Continuent 2014 24

Comparing MySQL and Hadoop Data $ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib! ...! $ /opt/continuent/tungsten/bristlecone/bin/dc ! -url1 jdbc:mysql:thin://logos1:3306/db01 ! -user1 tungsten -password1 secret ! -url2 jdbc:hive2://localhost:10000 ! -user2 'tungsten' -password2 'secret' -schema db01 ! -table sbtest -verbose -keys id ! -driver org.apache.hive.jdbc.HiveDriver! 22:33:08,093 INFO DC - Data comparison utility! ...! 22:33:24,526 INFO Tables compare OK! ©Continuent 2014 25

Doing it all at once $ git clone ! https://github.com/continuent/continuent-toolshadoop.git! ! $ cd continuent-tools-hadoop! ! $ bin/load-reduce-check ! -U jdbc:mysql:thin://logos1:3306/db01 ! -s db01 --verbose ©Continuent 2014 26

Demo #2 ! Constructing and Checking a Materialized View ©Continuent 2014 27

Scaling It Up! ©Continuent 2014 28

MySQL to Hadoop Fan-In Architecture Masters Slaves Replicator m1 (master) RBR Replicator Replicator m1 (slave) m2 (master) m2 (slave) m3 (slave) RBR Replicator m3 (master) RBR ©Continuent 2014 29 Hadoop Cluster (many nodes)

Integration with Provisioning MySQL Access via Hive (Initial provisioning run) Sqoop/ETL Tungsten Master MySQL Binlog Tungsten Slave hadoop hadoop binlog_format=row ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files 30 Hadoop Cluster

On-Demand Provisioning via Parallel Extract Access via Hive MySQL binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) Extract from MySQL tables ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) 31 Hadoop Cluster

Tungsten Replicator Roadmap • • • Parallel CSV file loading • • Replication out of Hadoop ©Continuent 2014 Partition loaded data by commit time Data formats and tools to support additional Hadoop clients as well as HBase Integration with emerging real-time analytics based on HDFS (Impala, Spark/Shark, Stinger,...) 32

Getting Started with Continuent Tungsten ©Continuent 2014 33

Where Is Everything? • Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/ • Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/ deployment-hadoop.html • Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop Contact Continuent for support ©Continuent 2014 34

Commercial Terms • • • Replicator features are open source (GPL V2) Investment Elements • • • POC / Development (Walk Away Option) Production Deployment Annual Support Subscription Governing Principles • • ©Continuent 2014 Annual Subscription Required More Upfront Investment -> Less Annual Subscription 35

We Do Clustering Too! GonzoPortal.com Tungsten clusters combine offthe-shelf open source MySQL servers into data services with: apache /php ! • 24x7 data access • Scaling of load on replicas • Simple management commands ! ...without app changes or data migration Amazon US West ©Continuent 2014 36 Connector Connector

In Conclusion: Tungsten Offers... • Fully automated, real-time replication from MySQL into Hadoop • Support for automatic transformation to HDFS data formats and creation of full materialized views • Positions users to take advantage of evolving realtime features in Hadoop ©Continuent 2014 37

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs Continuent Web Page: http://www.continuent.com ! Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator ©Continuent 2014

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Real-Time Data Loading from MySQL to Hadoop - YouTube

Real-time Data Loading from Oracle and MySQL to Data Warehouses, ... Set Up & Operate Real-Time Data Loading into Hadoop - Duration: 59:06.
Read more

MySQL :: MySQL Applier for Hadoop

Enabling Real-Time MySQL to HDFS Integration. Big Data is ... loading to and from tables can ... of events between MySQL and Hadoop. MySQL Applier for Hadoop.
Read more

Real-Time Data Loading from MySQL to Hadoop using Tungsten ...

To follow-up and describe some of the methods and techniques behind replicating into Hadoop from MySQL in real-time, and how this can be combined into your ...
Read more

MySQL Applier for Hadoop: Video Tutorial - YouTube

MySQL Applier for Hadoop: Video Tutorial ... Real-Time Data Loading from MySQL to Hadoop - Duration: 57:23. VMware Continuent 4,178 views. 57:23
Read more

“Real-time Data Loading From MySQL to Hadoop" Webinar ...

... today announced that its highly attended Webinar “Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0” is now available ...
Read more

Archival and Analytics - Importing MySQL data into Hadoop ...

Archival and Analytics - Importing MySQL data into Hadoop ... them into a file in HDFS in real-time ... Loading data to table ...
Read more

Tungsten University: Real-time data loading from MySQL to ...

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have ...
Read more

MySQL :: Unlocking New Big Data Insights with MySQL

Unlocking New Big Data Insights with MySQL. ... MySQL is a key component of numerous big data ... valuable insights using MySQL with the Hadoop ...
Read more