Hadoop distributions - ecosystem

50 %
50 %
Information about Hadoop distributions - ecosystem

Published on October 13, 2014

Author: jaksky

Source: slideshare.net


An overview of hadoop distributions and its ecosystem particularly focused on hortonworks distribution.

1. Big Data Distributions and Ecosystem

2. BigData • BigData 3Vs – Volume – Velocity – Variety – (Veracity ~ Accuracy)

3. BigData ≈ • Started by Google White paper about Map-Reduce • Open Source • Apache Software foundation • Current version 2.2.0 • Second Generation • Set of tools – Core Map-Reduce • http://hadoop.apache.org/#What+Is+Apache+Had oop%3F

4. Distributions

5. Distribution Comparison • Different approach to near to real-time analytics ( Hortonworks vs Cloudera) • Different approach to cluster management • Different level of “Open Source” – vendor lock-in • Proprietary components – MapR-FS (MAPR) NOTE: BigData space is highly evolving.

6. We use

7. Ecosystem

8. • Map-Reduce paradigm • HDFS – Hadoop Distributed File System • YARN – Yet another resource negotiator – promotes cluster to non-MapReduce computational models • (TEZ) – should bridge the gap between batch and near-to-real-time operational model

9. PIG • Scripting language for Map-Reduce • Procedural language • Typical use cases: – standard extract-transform- load (ETL) data pipelines – research on raw data – iterative processing of data

10. HIVE • SQL-like syntax • Data warehouse – ad-hoc queries • Declarative language • On top of Map-Reduce

11. PIG x HIVE

12. HBase • Real-time access to data • non-relational (NoSQL) database • Columnar database • Run on tom HDFS • Fault tolerant, Flexible, Highly Available

13. Apache Storm • Distributed real-time computation system • Adds real-time data processing capabilities to Apache Hadoop • “Stream processing” • Run on top of YARN • Usually part of “λ architecture”

14. Apache Mahout • Scalable machine learning for Hadoop • Based on Map-Reduce • Algorithms: – Collaborative filtering – Clustering – Classification – Frequent item mining

15. Apache Flume • Streaming data into hadoop • Collecting, Aggregating • Guarantee data delivery • Scale horizontally

16. Apache Sqoop • Move data between Hadoop and structured datastores – relational databases • Import into HDFS, Hbase or Hive

17. Apache ZooKeeper • Distributed configuration service • Synchronization service • Naming registry • Reliable, Simple, Ordered

18. Apache Ambari • Management console to hadoop cluster • Monitoring • Incubation phase ASF

19. Apache OoZiE • Workflow engine • DAG = Directed Acyclic Graph • Integrates with: – MapReduce – PIG – Hive – Sqoop

20. Typical use case

21. Apache Falcon • Simplyfy data management and pipeline processing • Automate movement and processing of datasets • Data replication • Data eviction • Coordination and scheduling

22. Apache Knox • Authentication for Hadoop • Hadoop security • Expects to run on DMZ environment • Hadoop cluster protected by firewall

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Welcome to Apache™ Hadoop®!

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is ...
Read more

Distributions and Commercial Support - Hadoop Wiki

Distributions and Commercial Support. ... or derivative works and Commercial Support. ... and tests of the Apache Hadoop ecosystem.
Read more

Apache Hadoop – Wikipedia

Apache Hadoop ist ein freies, in Java geschriebenes Framework für skalierbare, verteilt arbeitende Software.
Read more

Comparing the top Hadoop distributions | Network World

Comparing the top Hadoop distributions. ... Hadoop is not only an integral part of the big data ecosystem but is a central force that gave a new start to ...
Read more

Hadoop Distributions – Hadoop Net

Hadoop Distributions. A Hadoop distribution solves the ... own distribution out of the Hadoop ecosystem. Vendors of Hadoop distributions such as ...
Read more

Apache Hadoop - Wikipedia

There are multiple ways to run the Hadoop ecosystem on Google Cloud ... Apache Hadoop Project can be called Apache Hadoop or Distributions of ...
Read more

Hadoop and its evolving ecosystem - CEUR-WS.org

Hadoop and its evolving ecosystem J. Yates Monteith, John D. McGregor, and John E. Ingram School of Computing Clemson University {jymonte,johnmc,jei ...
Read more

How the 9 Leading Commercial Hadoop Distributions Stack Up

All of the leading commercial Hadoop distributions are compatible with Apache Hadoop, so what sets them apart? Here's how the leading commercial ...
Read more

Big Data : Hadoop Distributions Compared

Some of the ecosystem components are explained below: Hive: A data warehouse infrastructure with SQL like querying capabilities on Hadoop Data Sets
Read more

Hadoop Distributions :: Hadoop Illuminated

The Hadoop ecosystem contains a lot of components (HBase, Pig, Hive, ... Hadoop Distributions. Distro Remarks Free / Premium; Apache hadoop.apache.org:
Read more