Published on February 4, 2014
nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014
We better get it right first time!
Starting point ● we had two developers ● we had one live server ● we had one cold backup ● we can’t store all the data ● we can’t process all the data
Batch processing - the downsides ● batch every 3 hours ○ delete old data ● updating counters ○ you need to define them upfront ● throwing away old data ○ developer point of view ■ you have no way to correct your mistake ○ business ■ you lose your data
Batch processing - the benefits ● you will learn ○ profiler is your best friend ○ optimizing can be hard and can take time ● what are good access logs good for ○ reconstruct your deleted data
Business says: save all data
Big Data ● It’s not only about the volume ● What we gonna do with it? ○ We had NO idea! ● We rent more servers. ○ We needed place where to store the data
Big Data ● We went the NoSQL way ○ MongoDB ■ easy replication, possible sharding ■ upsert ■ rich document based queries - we still were one foot in the SQL world ■ fast prototype ● We were still doing batch processing ● ~15m impressions per day ending with ~5GB raw data per day
Big Data ● each day as collection ○ easy for batch processing ● each impression as a document ● adding processed parameters over time ● pulling data from 30 collections ○ server is not responding ○ virtual memory is low
Big Data - analytics ● Visitors counts on website/section ○ active - with subscription ○ inactive - without subscription ○ anonymous ● Content consumption ○ how many pageviews ■ active ■ inactive ■ anonymouse ● and others
Business asks: how many UNIQUE users did … in month
What we really need ● COUNT(* || DISTINCT ...) GROUP BY ○ entities ○ date periods (day, week, month) ○ combination of entities and date periods [and some other flags] ● Special demands from analytics team ○ Not too hard to implement with SQL magic ● As fast as possible ○ Minimally as fast as data are incoming ● Still store all historical raw data ○ Ideally compressed
What to do ● Processing raw data? ○ Use lot of space, before getting result ■ We need to store historical data anyway ■ You can store compressed files (LZO) in Hadoop ● Sharding ○ For how long? ○ How to properly determine sharding key(s)? ● Do you have really big amount of data? ● Do you have hardware for running Hadoop? Really? ● What overnight batch processing really means?
Naive solution ● Separate counter for each needed combination, updated for each impression, maybe with touching DB ○ Fast to generate unique key for combination ■ md5([entityType, entityId, day, dayId].join("|")) ○ Really fast to get value ■ Always primary key ■ Multiget ○ Need to define all GROUP BY combinations on beginning ○ Failure during processing one impression ■ Need to increment counters in transaction
Real world solution ● Kafka ○ Buffering incoming data ○ Web workers as producers ● Storm / Trident ○ Consuming data from Kafka ○ Processing incoming data ○ Using cassandra as storage backend ● Cassandra ○ Holding counters and helper informations to determine uniquity
Storm ● Real time processing of unbounded streams of data ○ Processing data as they come ○ You still need to have computing power ○ Need to transform COUNT(* || DISTINCT ...) GROUP BY everything to steps of updates of counters ○ Java, but bolts can be written in different languages
Storm ● Spouts ● Bolts
Trident ● High level abstraction over Storm ○ ○ ○ ○ ○ Joins Aggregations Grouping Filtering Functions
Trident ● Operating in transactions ● Persistent aggregation ○ “Memcached” ○ Cassandra ● DRPC calls ○ No need to touch Cassandra ● Local cluster for development ● Easy to learn basics ● Hard to discover advanced stuff ■ Lack of documentation ■ Need to tune configuration
Trident ● Functions ○ You can do everything you want ■ Touch DB, read emails, … ○ Stay with java ■ No dependencies problem ■ No performance penalty ● Topology ○ Good to define on beginning ■ Spend time on detailed diagram ■ Save you during implementation and future updates ○ Don’t do it too much complex ■ Problem with loading it
Cassandra ● Already in our production on different project ● No SPOF ● Multi Master ● Scalable ● More good stuff ● Lot of new features in 2.x ○ Lite transactions ○ Lot of fixes ■ Good old times on 0.8 ■ Our bug report from 2011 - Double load of commit log on node start :)
Kafka ● A high-throughput distributed messaging system ● Something like distributed commit log ○ You can set retention ○ You can move reading offset back ■ Used by Trident transactions ● Cluster ● Ideally to use with Trident
Business asks: are you ready for ~250m impressions per day?
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
Data gathering and processing; ... to apply the talent and passion you’ve demonstrated in traditional media to a revolutionary new approach to content ...
The choice of method is influenced by the data collection strategy, the type of variable, the accuracy required, the collection point and the skill of the ...
... natural language processing and ... truly agnostic approach to gathering diverse data sources and ... Ripjar’s Strategic ...
Data collection is the process of gathering and measuring ... quality control’ as two approaches that can preserve data integrity and ensure ...
... results team with particular skill to gathering data, analyzing, processing ... to approach new ... Media. Kontakt; Impressum;
Learning Space Toolkit. Main menu. Roadmap; Needs Assessment; Space Types; Services; Technology; Integration; ... Overview of gathering data from trends ...
Benefits of databases The gathering, ... a disciplined approach to data management ... system is unlikely to offer sophisticated data processing or
the information gathered four years ago for the ﬁ rst Anti - Cartel Enforcement Manual. ... approaches to digital evidence gathering ... media (data ...
We approach social crises as ... collective social reporting as an information processing ... Community Intelligence and Social Media Services: A Rumor ...