Published on February 27, 2014
MyHeritage and Kafka Author: Ran Levy Feb 2014
Agenda • MyHeritage use cases • Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
Use cases • Two major use case: – Indexing to SuperSearch and Record Matching. – Stats reporting to BI.
Use case 1 • Indexing to SuperSearch and Record Matching
Use case 1 – con’t • Custom and non-scalable solution that involved changes processing and updating SuperSearch (SOLR over Lucene). • Required solution should support: – Continuous mode. – High throughput. – Scaling up. – Repeating the process from some point. – Guaranteed order of processed items. – Reliable. – Multiple consumers.
Use case 2 • Statistics reporting to BI system
Use case 2 – con’t • Required solution should support: • • • • High scale (~500GB of data / day). Scale up – few hundred millions per day. Repeating the process from some point. Multiple consumers.
Agenda MyHeritage use cases • Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
Possible Solutions • So what we have considered …. – DB • Queues
Possible Solutions • Key point about queues – Messages are deleted after consumed. – Messages are duplicated to support multiple readers.
Agenda MyHeritage use cases Possible solutions • Kafka overview • Actual implementation @MyHeritage • Summary
Kafka Overview • A high throughput distributed messaging system – – – – – Fast Scalable Durable Distributed by design Simplicity (over functionality)
Kafka Overview • Fast (very fast) – both for producer and consumer Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
Kafka Overview • Main entities – Producer – push data. – Consumer – pull data. – Brokers – load balance producers by partition. – Topic – feeds of messages belongs to the same logical category.
Kafka Overview – some internals • Communication between the clients and the servers is done with a simple, high-performance TCP protocol. • For each topic, the Kafka cluster maintains a partitioned log which is a commit-log (appends only).
Kafka Overview – some internals • Messages stay on disk when consumed, deleted after defined TTL. • The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. • Each partition is replicated across a configurable number of servers for fault tolerance.
Agenda MyHeritage use cases Possible solutions Kafka overview • Actual implementation @MyHeritage • Summary
High Level Overview … Daemons Family Tree changes Topic Family Tree changes Topic part 1 part 1 part 2 part 2 DRBD replica Of Broker 2 part 32 Consumers Activity Topic Indexing part 1 part 1 RecordMatching part 2 part 2 … part 32 … Face recog. Broker 2 … Web Broker 1 … Producers Logstash reader part 32 part 32 Activity Topic DRBD replica Of Broker 1
Kafka @Myheritage - producers App App Module App Module Module Subscriber Dispatch event Events System Notify Subscriber EventLogger Subscriber Activity Manage r ILogWrite
Kafka @Myheritage - producers Topic BrokersConfig IStats KafkaWriter ISelector ILogger ISerializer
Kafka @Myheritage - producers App App Module App Module Module Subscriber Dispatch event Events System Notify Subscriber EventLogger Subscriber KafkaWriter (if failed) Attempt 2nd broker Broker Attempt 1st broker Broker
Kafka @Myheritage – Consumers (Indexing) 1 Per consumer type, reader per partition KafkaWatermark Get/update watermark Broker 1 EventProcessor EventProcessor EventProcessor Broker 2 Add event to queue IndexingQueue Fetch work IndexingWorkers IndexingWorkers IndexingWorkers Update item SOLR
Agenda MyHeritage use cases Possible solutions Kafka overview Actual implementation @MyHeritage • Summary
Summary Kafka is very fast and scalable system, that is extensively used at MyHeritage, and you would want to consider it for high scale systems you are using.
Thank you and questions email@example.com
... NYC Working Mommy Meetup Group We're 173 Cool moms, Working, City moms Queensboro Tri Club. Queensboro Tri Club ...
Where can I find slides for the user group meetup ... to Kafka David Arthur, TriHUG July 2014. 2014 ... popular use cases for Apache Kafka.Feb ...
ClearStory use case + HA Spark Streaming - Bay Area Spark User MeetUp ... Spark Streaming w/ Kafka ...
... including use cases, ... Standalone Spark approach A joint meetup between Israel Spark Meetup and HadoopIsrael Meetup 18 ... 2014 · 6:30 PM Meetup on ...
Cloudera Engineering Blog. Best practices, how-tos, use cases, ... we’re standing by to assist your meetup by providing speakers, ...
Introducing Family Tree Builder 8.0. ... version 8.0 projects now use an .ftb extension instead of .zed and .uzed used by previous ... MyHeritage team.
SQL Server 2014 In-Memory OLTP Use Case - Table Variable Conversion ... Published on Feb 3, ... SQL Server 2014 In-Memory OLTP Use Case ...
... used by Lynes South Carolina Web Site. MyHeritage is the best ... things or how to use the ... on Feb 19 2016 15:15: Shirley Mae Case ...