SenseiDB

Sensei

Volodymyr Zhabiuk

Agenda
1.  History and motivation
2.  High level architecture
3.  Data guarantees
4.  Features detailed overview
5.  Quick demo

What is Sensei
—  search engine and database
—  Built on top of Lucene
—  Full text search, relevance, faceting
—  Distributed, horizontally scalable

History
•  Technology stack for LinkedIn.com's search,
analytics and homepage

•  Open sourced in 2009, first 1.0.0 release February
2012

•  https://github.com/linkedin/sensei

•  http://senseidb.com

—  sensei-search Google group
—  Used by Xiaomi, several other OS deployments

Why yet another Lucene based
search engine?

search engine?
•  Indexing elevates query latency
•  Hard to distribute

search engine?

•  Large memory overhead
•  Comparatively slow

search engine?

•  Large memory overhead
•  Comparatively slow

SenseiDB •  Designed for LinkedIn search use
cases and the Homepage

Motivation
•  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search
support)

•  Faceted search

Motivation
•  Indexing/Query isolation

•  Structured vs. unstructured data (e.g. fulltext search
support)

•  Faceted search

•  Business intelligence

Sensei’s features
•  Fast updates

•  Rich query language - BQL

•  Fulltext and faceted search

•  Distributed and elastic

•  Indexing and search customization

•  In memory M/R

What Sensei doesn’t do
—  Transactions and OLTP
—  Dynamic shard rebalancing
—  Multi tenancy and table joins
—  Dynamic schema

Volume

—  5-100 mln documents per node
—  ~300K updates per minute
—  Query latency < 100 ms

Deployments
—  Search engine for SeaS
—  Backend for USCP– 400 nodes
—  >6 deployments in the team $
—  Other companies(2 deployments at Xiaomi)

Sensei’s technologies

Sensei

Lucene


Sensei

Zoie

Lucene


Sensei

Bobo

Zoie

Lucene


Sensei

Bobo
Norbert

Zookeeper
Zoie

Lucene

Vocabulary
Node Shard/Partition Replica

Data injection

Sensei node
Event w/ version

Gateway

Get events with version
bigger than the existing

JDBC Databus RabbitMQ Kafka

Data guarantees
•  Availability - replications

•  Eventually consistent across replications

•  Write durability - data stream

•  Write consistency - data stream

Configuration
—  schema.xml
—  Indexed fields,
—  forward index customization
—  sensei.properties
—  ports, plugins, zookeeper urls, etc

Lucene realtime extension

Disk Index

Realtime updates
•  Updates are seen right away < 1s upon inserting

•  Handles deletes and updates

•  Indexing latency stable as index size grows

•  Incremental and balanced segment merges

Offline indexing and archive
•  Efficient M/R indexing generation on Hadoop over
ETL'd data

•  Bootstrap from HDFS

Query Engine - Bobo
•  Query planning/optimization

•  Access to both inverted and forward data structures

•  High performance faceting

•  Dynamic sorting

•  Dynamic relevance support

•  Map/Reduce analytics engine

Bobo(cont.)

Custom Custom Custom
(forward) index (forward) index (forward) index

Result

Lucene segment Lucene segment Lucene segment

Sensei API - BQL

SELECT color, category, year, makemodel
FROM cars
WHERE NOT MATCH(color, category)
AGAINST("*van")
GROUP BY category TOP 1
LIMIT 1000

Dynamic relevance
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"cool")
DEFINED AS (String favoriteColor, String favoriteTag)
BEGIN
float boost = 1.0;
if (tags.contains(favoriteTag))
boost += 0.5;
if (color.equals(my_color))
boost += 1.2;
return _INNER_SCORE * boost;
END

Partial updates
—  Storing data outside of Lucene
—  High update rate
—  Perfect for counters

Sensei in memory M/R

Node1

Broker

Node2


map(IntArray docs, FieldAccessor, FacetCountAccessor)

Node1

Broker

Node2

Lucene segments


List<MapResult> combine(List<MapResult>)

Node1

Broker

Node2

Lucene segments


List<MapResult> combine(List<MapResult>)
Node1
Node1

Broker

Node2 Node1

Lucene segments


JSONObject reduce(List<MapResult>)
Node1
Node1

Broker Broker

Node2 Node1

Lucene segments


—  select distinctCount(memberId), sum(clickCount)
where geo = ‘US/CA/SF’ group by seniority, age

Roadmap
•  Just finished
o  Sensei aggregation functions
o  Map/Reduce analytics engine

•  Plan
o  Goshawk – for business inteligence (WVMP v2, LI
Impressions)
o  Zoie Redesign to support fixed length in memory
segments

Questions?
—  SeaS Homepage: http://go/seas
—  Questions: ask_seas@
—  Sensei homepage: senseidb.com
—  Sensei Google group: sensei-search

SenseiDB

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie SenseiDB

Ähnlich wie SenseiDB (20)

SenseiDB