Apache spark

50 %
50 %
Information about Apache spark

Published on June 12, 2016

Author: ramakrishnakapa

Source: slideshare.net

1. INTRODUCTION Apache spark is an open source cluster computing system that focus data analytics fast and both to run and fast to write. Apache Spark is a fast, in-memory data processing engine with smart and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets .

2.  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in- memory computing.

3.  Write applications quickly in Java, Scala, Python, R.  Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells

4.  Compound SQL, streaming, and complex analytics.  Spark powers a stack of libraries including SQL and DataFrames,MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

5.  Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3. Spark HDFS,Hbase Hadoop Spark SQL Hive

6.  Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O  Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.  Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in- memory.

7.  Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

8.  Iterative Algorithms in Machine Learning  Interactive Data Mining and Data Processing  Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.  Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis  Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.

9.  Spark provides an interactive shell − a powerful tool to analyze data interactively. It is available in either Scala or Python language. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

10.  RDD transformations returns pointer to new RDD and allows you to create dependencies between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD.  Spark is lazy, so nothing will be executed unless you call some transformation or action that will trigger job creation and execution

Add a comment

Related pages

Apache Spark™ - Lightning-Fast Cluster Computing

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Read more

Apache Spark – Wikipedia

Apache Spark ist ein Framework für Cluster Computing, das im Rahmen eines Forschungsprojekts am AMPLab der University of California in Berkeley entstand ...
Read more

Downloads | Apache Spark

Download Apache Spark™ Our latest stable version is Apache Spark 1.6.1, released on March 9, 2016 (release notes) Choose a Spark release:
Read more

Apache Spark - Wikipedia, the free encyclopedia

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was ...
Read more

IBM Analytics – Apache Spark – Deutschland

Was ist Spark? Apache Spark ist ein überaus vielseitiges, quelloffenes Framework für die Clusterdatenverarbeitung, die sich durch schnelle ...
Read more

What is Apache Spark - Hortonworks

Apache Spark is a fast, in-memory data processing engine with development APIs to allow data workers to execute streaming, machine learning or SQL.
Read more

What is Apache Spark | Databricks

The team that created Apache Spark founded Databricks in 2013. Apache Spark is 100% open source, hosted at the vendor-independent Apache Software Foundation.
Read more

Apache Spark - YouTube

What Lies Beneath Apache Spark's RDD API Using Spark shell and WebUI - Duration: 29 minutes
Read more

Apache Spark - Download - heise online

Download kostenlos sicher. Name: Apache Spark Hersteller-Link: Offizielle Webseite Sprache: Englisch Betriebssysteme: Windows Vista, 7, 8, 10, Linux
Read more

Big Data Processing with Apache Spark – Part 1: Introduction

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. In this article, Srini ...
Read more