Batch Indexing & Near Real Time, keeping things fast
1. Batch Indexing & Near Real Time,
keeping things fast
Marc Sturlese
Software engineer @ Trovit
Thursday, 2 May 2013
2. About me...
• Marc Sturlese – @sturlese
• Software engineer @Trovit. R&D focused
• Responsible for search and scalability
Thursday, 2 May 2013
3. Agenda
• Who we are
• Batch architecture. Hadoop & Hive
• Near real time architecture. Storm & stuff
• Putting it all together
• Alternatives and Future directions
• Questions
Thursday, 2 May 2013
6. Batch Layer
• Hadoop based
• Documents are crunched by a pipeline of MR
jobs
• Hive to save stats of each phase
Thursday, 2 May 2013
7. Batch Layer
Pipeline overview
Incoming data
Deployment
Lucene Indexes
Ad Processor Diff Matching Expiration Deduplication Indexing
t – 1
External Data
Hive Stats
Hadoop Cluster
Thursday, 2 May 2013
8. Batch Layer
The good things!
• Index always built from scratch. Small number of
big segments
• Multicast deployment allows to send indexes to
all slaves at the same time.
• Backups convenient on HDFS
Thursday, 2 May 2013
9. Batch Layer
That was cool but...
• Not even close to real time
• Crunch documents in batch means to wait until
all is processed. This can take a few hours
• We want to show the user fresher results!
Thursday, 2 May 2013
10. Near real time Layer
Storm and stuff to the rescue
Thursday, 2 May 2013
11. Near real time Layer
Storm properties
• Distributed real time computation system
• Fault tolerance
• Horizontal scalability
• Low latency
• Reliability
Thursday, 2 May 2013
12. Near real time Layer
Storm in action
Slave
Slave
Solr prod replicas
Slave
XML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Kafka spout
Kafka spout
XML spout Doc Manager bolt Indexer bolt
SHUFFLE
GROUPING GROUPING
FIELD
Thursday, 2 May 2013
13. Near real time Layer
Storm in action
• Spouts just read and send
• Doc Manager Bolt processes and classifies
• Indexer Bolt adds documents to Solr
• Replicated logic with different implementation
• Careful not to overload Solr slaves...
Thursday, 2 May 2013
14. Near real time Layer
Storm in action
Thursday, 2 May 2013
15. Near real time Layer
Storm in action. But...
Thursday, 2 May 2013
16. Near real time Layer
Storm in action. But...
• Now Solr has to handle user queries and storm
inserts
• Field grouping on Indexer Bolt for politeness
• Small bulks to reduce insert requests
• Committing on many cores, same host, same
time can be painful
Thursday, 2 May 2013
17. Near real time Layer
Storm in action - Committing
Indexer Bolt Cars US
Real state UK R1 Cars US R1 Cars US R2 Jobs BR R1 Jobs BR R2 Real state ES R1
Indexer Bolt Jobs BR
ZooKeeper Locker
Slave 1 Slave 2 Slave N
. . .
Thursday, 2 May 2013
18. Near real time Layer
Storm in action
• Adding documents now is fast
• Keep number of segments small
• Avoid merges on big segments
• Just add new docs (no deletes or updates)
Thursday, 2 May 2013
19. Mixed Architecture
Putting it all together
15
Slave
Slave
Solr prod replicas
Slave
XML feed
XML feed
Kafka partition
Kafka partition
Storm topologySources
Hbase doc info
Bulk add
Exists?
MR Pipeline
zk
Thursday, 2 May 2013
20. Mixed Architecture
Swapping indexes
• NRT docs might not be contained in the new
batch index (even fresher than the “being built”
batch index)
• This can lead to inconsistencies...
Thursday, 2 May 2013
24. Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
NRT t+1
NRT t+2
Thursday, 2 May 2013
25. Mixed Architecture
Swapping indexes
HBase
XML feed t
Slave t+1
Slave t
Pipeline t
Pipeline t+1
XML feed t+1
XML feed t+2
NRT indexer
Batch indexer
NRT t+1
NRT t+2
Thursday, 2 May 2013
26. Mixed Architecture
Swapping indexes
• NRT indexed docs must be stored in a
temporary storage
• Fetch missing docs from the storage and add
them before the next deploy
• This avoids time jumps
Thursday, 2 May 2013
27. Mixed Architecture
Storm and Hadoop
• Near real time inserts, low latency
• Hadoop handles deletes and updates. No rush
on those
• No merges on big segments so optimal query
response times
• Tolerant to human errors
• Temporary lost of accuracy on the NRT layer
Thursday, 2 May 2013
28. Alternatives
SolrCloud - Why not?
• Good for the vast majority of use cases
• Incremental inserts/updates/deletes oriented.
Pay segment merges per real time
• Need to deploy full indexes fast (faster that rsync
or http replication)
• Now full deploy easier with aliases
Thursday, 2 May 2013
29. Future lines
Lucene real time feature
• Allows to see docs in the index before they are
committed
• Good but not a must right now for the use case
• Very easy to integrate on the current
architecture
Thursday, 2 May 2013
31. Thanks for your attention!
Marc Sturlese
marc@trovit.com
Lucene/Solr Revolution 2013, San Diego, May 1 2013
Thursday, 2 May 2013
32. CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
Thursday, 2 May 2013