Published on February 25, 2014
Agenda What do we mean by synergy? Storm Shark / Spark Redis ElasticSearch Hadoop
What do we mean by Synergy? synergy 1. The interaction of two or more agents or forces so that their combined effect is greater than the sum of their individual effects.
What do we mean by Synergy? Cassandra excellent for: Fast read or write performance Scalable, runs on commodity hardware Reliable cross-DC replication Robust persistence for high volume data Needs some special sauce for: Real-time calculations for high volume streams Complex search functions (free-text etc.) Map Reduce on RDDs
Storm Open Sourced by Twitter in 2011 Distributed event processor Operates on Resilient Distributed Data Sets Getting started in Apache Incubator Can persist to and read from from C* Great for high volume, real time (complex) calculations on streamed data
Storm Is a CEP architecture Spout – Collects & submits tuples for processing Bolt – processes tuples and emits new tuples Tuple – a collection of data passed in storm Stream – identifies outputs from a spout / bolt and enforces tuple structure Uses Zookeeper and ZeroMQ for coordination and message passing respectively
Synergy? Can use Cassandra as the input data source Can write tuples into Cassandra Example project here… https://github.com/tjake/stormscraper/ See CassandraWriterBolt.java for simple example of a Java Driver CQL based bolt that writes to Cassandra. Good as an example application, but not production ready
Use Case Top N words for popularity tracking Input: a constant stream of messages into the system Count occurrences of each word in a message Store raw messages in Cassandra Use a bolt to break up messages and maintain sorted list of top N words Persist the Top N words and their counts periodically in Cassandra
Use Case CREATE TABLE messages (date_hour TIMESTAMP, message_id TIMEUUID, message VARCHAR, PRIMARY KEY(date_hour, message_id)); CREATE TABLE top_words (date_hour TIMESTAMP, position INTEGER, word VARCHAR, PRIMARY KEY(date_hour, position));
Use Case https://github.com/nathanmarz/storm-starter/ Use RollingTopWords.java as base Integrate CassandraWriterBolt into use case Add spout for input messages Add bolt for persisting messages & writing Top N words Reference : http://www.michaelnoll.com/blog/2013/01/18/implementing-real-timetrending-topics-in-storm/
Storm: Conclusion Powerful Architecture Lots of potential as an Apache project Nice abstractions to simplify development (Trident) Great for operating on high velocity, high volume streams Not prohibitively difficult to integrate with other systems for input and output Lots of people experimenting with it!
Spark & Shark Lightning fast cluster computing
Apache Spark 100x faster than Hadoop MapReduce! Faster in-memory MapR operations Integration with Cassandra either via: https://github.com/tuplejump/calliope-release Or via Cassandra’s Hadoop support Combines SQL, Streaming and Complex Analytics
Apache Spark Can read and write to Cassandra… Reading from CF / Table into RDD via Calliope (Scala) val cas = CasBuilder.cql3.withColumnFamily("casDemo", "Words”).where("book = 'The Three Musketeers'”) val rdd = sc.cql3Cassandra[Map[String, String], Map[String, String]](cas) * where clause can use partition key or secondary index, CasBuilder also supports paging
Shark With Spark we can achieve super fast in-memory queries on subsets of data in Cassandra Effectively all the features of Hive running on RDD not HDFS Uses HiveQL queries Includes machine learning algorithms out of the box CqlStorageHandler provided to read RDD from Cassandra or read SSTables directly https://github.com/richardalow/cassowary
Spark / Shark: Conclusion Need resource isolation if running directly on Cassandra nodes Otherwise dealing with higher latency but not affecting cluster resources Impressive possibilities for machine learning algorithms as well as more basic Hive queries Introduces possibilities for JOINs on hot data!
What is it? “Redis is an open source, BSD licensed, advanced keyvalue store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.”
Synergy? Good for… Sorting sets & lists Pubsub messaging (more) Accurate counters Merging sets Transactions! Works in memory, can serve data fast based on key Good for runtime storage of aggregate data Could use shared resources on Cassandra nodes (could populate most recent data via triggers (naughty))
Elastic Search Distributed real-time search engine based
What is it? Distributed real-time search engine Built from the ground up for reliability and scalability Supports lots of other features as well free text search Spatial Query by arbitrary fields Facets Multi-lingual query support
Synergy? Although external to Cassandra it can provide rich query capabilities over the same data Simplify Data Models in Cassandra to maximise storage Separate read and write workloads (read from ES, write to Cassandra) Some integration for Storm for writing records to elastic search and Cassandra as data enters the system Again… Spatial!
Hadoop Batch Analytics
What is it? Open Source under Apache License 2.0 Top Level Apache project Runs on commodity hardware Used for storage and large scale processing of data-sets Lots of complementary tools… impala, mahout etc.
Some terms… HDFS a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop MapReduce - a programming model for large scale data processing. Hive - An SQL like abstraction for map reduce jobs Pig - A procedural style language for expressing map reduce jobs
Synergy? Multiple ways to use it with Cassandra DataStax Enterprise supports Hadoop on top of a Cassandra File System Replication managed in-cluster (efficient) Full Hadoop toolset available Some Hadoop support in vanilla distribution. Limited support for efficient querying
View Cassandra Synergy Executive Search's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like ...
View Cassandra McAlpine's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like Cassandra McAlpine ...
View Cassandra McAlpine's business profile as Operations Director at Synergy and see work history, affiliations and more.
Synthetic Energy_196 AB (SEAB) is a company within the Cassandra Oil group that supplies innovative technologies for efficient waste processing.
Posts about synergy written by Cassandra Chandler ... Winter is the quiet time, a time for reflection and planning. When the days begin to warm, I know ...
Sue Goodwin: Import Trade Finance Administrator: Sue joined the Synergy team in 2010 after spending 32 years within the banking financial sector.
Havard, Cassandra, Synergy and Friction - The CRA, BHCs, the SBA, and Community Development Lending (1997). Kentucky Law Journal, Vol. 86, p. 617, 1997-1998.
Find out how Cassandra Harris handled her relationships and test what you and Cassandra Harris have going in love, marriage, friendship, partnership ...
Tanzende Art & Design ~ Contemporary and fine art of Cassandra Fink Anderson ... the perfect synergy of dimension and light. ... Cassandra, Your art is ...