Published on November 12, 2013
Analytics on top of Cassandra and Hadoop Dmitry Mezhensky | Mirantis Inc #CASSANDRAEU
What we will discuss today ● Analytics on Cassandra using Hadoop ● Various types of statistics & implementation ● Scalability of approach #CASSANDRAEU
Problems ● Too many statistics (more that 100) ● Various types ○ Top N ○ Time series ○ Min/max/average/median ○ Extremum values on time interval ○ Fraud analysis ● Huge amount of data ● Scalability of approach #CASSANDRAEU
Statistics implementation on Hadoop #CASSANDRAEU
Top N ● Map phase generates <Key, Value> pairs, top N is building by Value ● Reduce phase accumulates values, persist to Cassandra is done via custom output format ● For top N entities in Cassandra suitable comparator was used #CASSANDRAEU
Top N ● One write stage to Cassandra sorting is done by value ● On reading stage first N records will be Top N values #CASSANDRAEU
Time series ● Map phase generates pairs <Time, Value> ● Reduce phase accumulates (various behaviour for different statistics) ● Persist to Cassandra using custom output format & using one row key per statistics, one column per date #CASSANDRAEU
Maximum, minimum, extremum on interval ● Max/min values are simple to calculate ● Extremum on interval is calculating the similar to time series #CASSANDRAEU
Fraud analysis ● Fraud analysis is running after all statistics are calculated ● Processed data is filtered by fraud filters #CASSANDRAEU
Scalability approach ● ● ● ● Data is reading/writing to Cassandra only Hadoop is elastically scalable Cassandra is elastically scalable No bottleneck #CASSANDRAEU
Thank you! #CASSANDRAEU
Speaker: Evan Chan, Ooyala Slides: http://www.slideshare.net/planetcassa... This session covers our experience with using the Spark and Shark ...
... http://www.slideshare.net/planetcassandra/c-summit-eu-2013 ... 2013: Mixing Batch and Real-Time: Cassandra ... Cassandra has Hadoop ...
Welcome to Apache Cassandra Summit Europe 2013. ... DSE Analytics ... Apache Cassandra, Cassandra, Apache Hadoop, ...
C* Summit EU 2013: From CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop Mr. Lokal.. original link http://www ... Top Uploaders;
A big data case study: ... Apache Cassandra and data crunching platform Hadoop, ... generation analytics system based on Cassandra, ...
The Apache™ Hadoop® project develops open ... 2013: release 2.2.0 available . Apache Hadoop 2.x reaches ... Apache Hadoop takes top prize at Media ...