5. Use case 1 – con’t
•
Custom and non-scalable solution that involved changes processing and
updating SuperSearch (SOLR over Lucene).
•
Required solution should support:
– Continuous mode.
– High throughput.
– Scaling up.
– Repeating the process from some point.
– Guaranteed order of processed items.
– Reliable.
– Multiple consumers.
7. Use case 2 – con’t
•
Required solution should support:
•
•
•
•
High scale (~500GB of data / day).
Scale up – few hundred millions per day.
Repeating the process from some point.
Multiple consumers.
8. Agenda
MyHeritage use cases
•
Possible solutions
•
Kafka overview
•
Actual implementation @MyHeritage
•
Summary
10. Possible Solutions
•
Key point about queues
– Messages are deleted after consumed.
– Messages are duplicated to support multiple readers.
11. Agenda
MyHeritage use cases
Possible solutions
•
Kafka overview
•
Actual implementation @MyHeritage
•
Summary
12. Kafka Overview
•
A high throughput distributed messaging system
–
–
–
–
–
Fast
Scalable
Durable
Distributed by design
Simplicity (over functionality)
13. Kafka Overview
•
Fast (very fast) – both for producer and consumer
Reference: http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
14. Kafka Overview
•
Main entities
– Producer – push data.
– Consumer – pull data.
– Brokers – load balance producers by partition.
– Topic – feeds of messages belongs to the same logical category.
15. Kafka Overview – some internals
•
Communication between the clients and the servers is done with a simple,
high-performance TCP protocol.
•
For each topic, the Kafka cluster maintains a partitioned log which is a
commit-log (appends only).
16. Kafka Overview – some internals
•
Messages stay on disk when consumed, deleted after defined TTL.
•
The partitions of the log are distributed over the servers in the Kafka cluster
with each server handling data and requests for a share of the partitions.
•
Each partition is replicated across a configurable number of servers for fault
tolerance.
17. Agenda
MyHeritage use cases
Possible solutions
Kafka overview
•
Actual implementation @MyHeritage
•
Summary
18. High Level Overview
…
Daemons
Family Tree
changes Topic
Family Tree
changes Topic
part 1
part 1
part 2
part 2
DRBD
replica
Of
Broker
2
part 32
Consumers
Activity Topic
Indexing
part 1
part 1
RecordMatching
part 2
part 2
…
part 32
…
Face recog.
Broker 2
…
Web
Broker 1
…
Producers
Logstash reader
part 32
part 32
Activity Topic
DRBD
replica
Of
Broker
1
22. Kafka @Myheritage – Consumers (Indexing)
1 Per consumer
type, reader per
partition
KafkaWatermark
Get/update watermark
Broker 1
EventProcessor
EventProcessor
EventProcessor
Broker 2
Add event to queue
IndexingQueue
Fetch work
IndexingWorkers
IndexingWorkers
IndexingWorkers
Update item
SOLR
23. Agenda
MyHeritage use cases
Possible solutions
Kafka overview
Actual implementation @MyHeritage
•
Summary
24. Summary
Kafka is very fast and scalable system, that
is extensively used at MyHeritage, and you
would want to consider it for high scale
systems you are using.