Published on June 1, 2016
1. 1© Cloudera, Inc. All rights reserved. Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu Jeremy Beard | Senior Solutions Architect, Cloudera May 2016
2. 2© Cloudera, Inc. All rights reserved. Myself • Jeremy Beard • Senior Solutions Architect at Cloudera • 3.5 years at Cloudera • 6 years data warehousing before that • email@example.com
3. 3© Cloudera, Inc. All rights reserved. Agenda • What do we mean by near-real-time analytics? • Which components can we use from the Cloudera stack? • How do these components fit together? • How do we implement the Spark Streaming to Kudu path? • What if I don’t want to write all that code?
4. 4© Cloudera, Inc. All rights reserved. Defining near-real-time analytics (for this talk) • Ability to analyze events happening right now in the real world • And in the context of all the history that has gone before it • By “near” we mean this is human scale (seconds), not machine scale (ns/us) • Closer to real time is possible in CDH, but is more custom development • SQL is the lingua franca of analytics • Millions of people know it or the tools that run on it • Say what you want to get not how you want to get it
5. 5© Cloudera, Inc. All rights reserved. Components from the Cloudera stack • Four components come together to make this possible • Apache Kafka • Apache Spark • Apache Kudu (incubating) • Apache Impala (incubating) • First we’ll discuss what they are, then how they fit together
6. 6© Cloudera, Inc. All rights reserved. Apache Kafka • Publish-subscribe system • Publish messages into topics • Subscribe to messages arriving in topics • Very high throughput • Very low latency • Distributed for fault tolerance and scale • Supported by Cloudera
7. 7© Cloudera, Inc. All rights reserved. Apache Spark • Modern distributed data processing engine • Heavy utilizer of memory for speed • Rich and intuitive API • Spark Streaming • Module for running a continuous loop of Spark transformations • Each iteration is a micro-batch, usually in the single-digit seconds • Supported by Cloudera (with some exceptions for experimental features)
8. 8© Cloudera, Inc. All rights reserved. Apache Kudu (incubating) • New open-source columnar storage layer • Data model of tables with finite typed columns • Very fast random I/O • Very fast scans • Developed from scratch in C++ • Client APIs for C++, Java, Python • First developed in Cloudera, now at Apache Software Foundation • Currently in beta, not yet supported by Cloudera, not production ready
9. 9© Cloudera, Inc. All rights reserved. Apache Impala (incubating) • Open-source SQL query engine • Built for one purpose: really fast analytics SQL • High concurrency • Queries data stored in HDFS, HBase, and now Kudu • Standard JDBC/ODBC interface for SQL editors and BI tools • Uses JIT query compilation and modern CPU instructions • First developed in Cloudera, now at Apache Software Foundation • Fully supported by Cloudera and in production at many of our customers
10. 10© Cloudera, Inc. All rights reserved. Near-real-time analytics on the Cloudera stack
11. 11© Cloudera, Inc. All rights reserved. Implementing Spark Streaming to Kudu • We define what we want Spark to do each micro-batch • Spark then takes care of running the micro-batches for us • We have limited time to process a micro-batch • Storage lookups must be key lookups or very short scans • A lot of repetitive boilerplate code to get up and running
12. 12© Cloudera, Inc. All rights reserved. Typical stages of a Spark Streaming to Kudu pipeline • Sourcing from a queue of data • Translating into a structured format • Deriving the storage records • Planning how to update the storage layer • Applying the planned mutations to the storage layer
13. 13© Cloudera, Inc. All rights reserved. Queue sourcing • Each micro-batch we first have to bring in data to process • This is near-real-time, so we expect a queue of messages waiting to be processed • Kafka fits this requirement very well • Native no-data-loss integration with Spark Streaming • Partitioned topics automatically parallelize across Spark executors • Fault recovery simple because Kafka does not drop consumed messages • In Spark Streaming this is the creation of a DStream object • For Kafka use KafkaUtils.createDirectStream()
14. 14© Cloudera, Inc. All rights reserved. Translation • Arriving messages could be in any format (XML, CSV, binary, proprietary, etc.) • We need them in a common structured record format to effectively transform it • When messages arrive, translate them first • Avro’s GenericRecord is a widely adopted in-memory record format • In Spark Streaming job use DStream.map() to define translation
15. 15© Cloudera, Inc. All rights reserved. Derivation • We need to create the records that we want to write to the storage layer • Often not identical to the arriving records • Derive the storage records from the arriving records • Spark SQL can define transformation, but much more plumbing code required • May also require deriving from existing records in the storage layer • Enrichment using reference data is a common example
16. 16© Cloudera, Inc. All rights reserved. Planning • With derived storage records in hand we need to plan the storage mutations • When existing records are never updated it is straight-forward • Just plan inserts • When updates for a key can occur it is a bit harder • Plan insert if key does not exist, plan update if key does exist • When all versions of a key are kept it can be a lot more complicated • Insert arriving record, update metadata on existing records (e.g. end date)
17. 17© Cloudera, Inc. All rights reserved. Storing • With the planned mutations for the micro-batch, we apply them to the storage • For Kudu this requires using the Kudu client Java API • Applied mutations are immediately visible to Impala users • Use RDD.forEachPartition() so that you can open a Kudu connection per JVM • Alternatively write to Solr, can be a good option where SQL is not required • Alternatively write to HBase, but storage is too slow for analytics queries • Alternatively write to HDFS, but storage does not support updates or deletes
18. 18© Cloudera, Inc. All rights reserved. Performance considerations • Repartition the arriving records across all the cores of the Spark job • If using Spark SQL, lower the number of shuffle partitions from default 200 • Use Spark Streaming backpressure to optimize micro-batch size • If using Kafka, also use spark.streaming.kafka.maxRatePerPartition • Experiment with micro-batch lengths to balance latency vs. throughput • Ensure storage lookup predicates are at least by key, or face full table scans • Avoid connecting and disconnecting from storage every micro-batch • Singleton pattern can help to keep a connection per JVM • Avoid instantiating objects for each record where they could be reused • Batch mutations for higher throughput
19. 19© Cloudera, Inc. All rights reserved. New on Cloudera Labs: Envelope • A pre-developed Spark Streaming application that implements these stages • Pipelines are defined as simple configuration using a properties file • Custom implementations of stages can be referenced in the configuration • Available on Cloudera Labs (cloudera.com/labs) • Not supported by Cloudera, not production ready
20. 20© Cloudera, Inc. All rights reserved. Envelope built-in functionality • Queue source for Kafka • Translators for delimited text, key-value pairs, and binary Avro • Lookup of existing storage records • Deriver for Spark SQL transformations • Planners for appends, upserts, and history tracking • Storage system for Kudu • Support for many of the described performance considerations • All stage implementations are also pluggable with user-provided classes
21. 21© Cloudera, Inc. All rights reserved. Example pipeline: Traffic
22. 22© Cloudera, Inc. All rights reserved. Example pipeline: FIX
23. 23© Cloudera, Inc. All rights reserved. Thank you firstname.lastname@example.org
Spark Streaming allows developers to ... New York Hadoop User group. ... this group; Join us! Building effective near-real-time analytics with Spark ...
Building effective near-real-time analytics with Spark Streaming and Kudu
You are here. Home; Events; Building effective near-real-time analytics with Spark Streaming and Kudu; Building effective near-real-time analytics with ...
... Building effective near-real-time analytics with Spark Streaming & Kudu: ... Building effective near-real-time analytics with Spark Streaming & Kudu
... and interface of Google Analytics; ... Building effective near-real-time analytics with Spark Streaming & Kudu ... Building Solns for 21st Century Transit
NY Hadoop User group May 31 ... Building effective near-real-time analytics with Spark Streaming and Kudu. This event belongs to PRontheGO. ...
... Spark Streaming & Dealing with ... The Apache Software Foundation Announces ... Building effective near-real-time analytics with Spark ...
Cloudera provides the world’s fastest, easiest, and most secure Hadoop platform. Cloudera provides the world’s fastest, easiest, and most secure ...
Holiday Science for Toddlers & Preschoolers. Fri, Dec 18 2015 at 10am - 11am Small Talk Family Cafe 1536 Newell Ave Walnut Creek, CA. Add to Calendar ...