2. Agenda
1. History and motivation
2. High level architecture
3. Data guarantees
4. Features detailed overview
5. Quick demo
3. What is Sensei
— search engine and database
— Built on top of Lucene
— Full text search, relevance, faceting
— Distributed, horizontally scalable
4. History
• Technology stack for LinkedIn.com's search,
analytics and homepage
• Open sourced in 2009, first 1.0.0 release February
2012
• https://github.com/linkedin/sensei
• http://senseidb.com
— sensei-search Google group
— Used by Xiaomi, several other OS deployments
6. Why yet another Lucene based
search engine?
• Indexing elevates query latency
• Hard to distribute
7. Why yet another Lucene based
search engine?
• Indexing elevates query latency
• Hard to distribute
• Large memory overhead
• Comparatively slow
8. Why yet another Lucene based
search engine?
• Indexing elevates query latency
• Hard to distribute
• Large memory overhead
• Comparatively slow
SenseiDB • Designed for LinkedIn search use
cases and the Homepage
9. Motivation
• Indexing/Query isolation
• Structured vs. unstructured data (e.g. fulltext search
support)
• Faceted search
10. Motivation
• Indexing/Query isolation
• Structured vs. unstructured data (e.g. fulltext search
support)
• Faceted search
• Business intelligence
11. Sensei’s features
• Fast updates
• Rich query language - BQL
• Fulltext and faceted search
• Distributed and elastic
• Indexing and search customization
• In memory M/R
12. What Sensei doesn’t do
— Transactions and OLTP
— Dynamic shard rebalancing
— Multi tenancy and table joins
— Dynamic schema
13. Volume
— 5-100 mln documents per node
— ~300K updates per minute
— Query latency < 100 ms
14. Deployments
— Search engine for SeaS
— Backend for USCP– 400 nodes
— >6 deployments in the team $
— Other companies(2 deployments at Xiaomi)
22. Data injection
Sensei node
Event w/ version
Gateway
Get events with version
bigger than the existing
JDBC Databus RabbitMQ Kafka
23. Data guarantees
• Availability - replications
• Eventually consistent across replications
• Write durability - data stream
• Write consistency - data stream
27. Realtime updates
• Updates are seen right away < 1s upon inserting
• Handles deletes and updates
• Indexing latency stable as index size grows
• Incremental and balanced segment merges
29. Offline indexing and archive
• Efficient M/R indexing generation on Hadoop over
ETL'd data
• Bootstrap from HDFS
30. Query Engine - Bobo
• Query planning/optimization
• Access to both inverted and forward data structures
• High performance faceting
• Dynamic sorting
• Dynamic relevance support
• Map/Reduce analytics engine
31. Bobo(cont.)
Custom Custom Custom
(forward) index (forward) index (forward) index
Result
Lucene segment Lucene segment Lucene segment
32. Sensei API - BQL
SELECT color, category, year, makemodel
FROM cars
WHERE NOT MATCH(color, category)
AGAINST("*van")
GROUP BY category TOP 1
LIMIT 1000
33. Dynamic relevance
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"cool")
DEFINED AS (String favoriteColor, String favoriteTag)
BEGIN
float boost = 1.0;
if (tags.contains(favoriteTag))
boost += 0.5;
if (color.equals(my_color))
boost += 1.2;
return _INNER_SCORE * boost;
END
41. Sensei in memory M/R
— select distinctCount(memberId), sum(clickCount)
where geo = ‘US/CA/SF’ group by seniority, age
42. Roadmap
• Just finished
o Sensei aggregation functions
o Map/Reduce analytics engine
• Plan
o Goshawk – for business inteligence (WVMP v2, LI
Impressions)
o Zoie Redesign to support fixed length in memory
segments