SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Sensei

     Volodymyr Zhabiuk
Agenda
1.  History and motivation
2.  High level architecture
3.  Data guarantees
4.  Features detailed overview
5.  Quick demo
What is Sensei
—  search engine and database
—  Built on top of Lucene
—  Full text search, relevance, faceting
—  Distributed, horizontally scalable
History
•    Technology stack for LinkedIn.com's search,
     analytics and homepage

•    Open sourced in 2009, first 1.0.0 release February
     2012

•    https://github.com/linkedin/sensei

•    http://senseidb.com

—  sensei-search Google group
—  Used by Xiaomi, several other OS deployments
Why yet another Lucene based
search engine?
Why yet another Lucene based
search engine?
               •  Indexing elevates query latency
               •  Hard to distribute
Why yet another Lucene based
search engine?
               •  Indexing elevates query latency
               •  Hard to distribute


                  •  Large memory overhead
                  •  Comparatively slow
Why yet another Lucene based
search engine?
                    •  Indexing elevates query latency
                    •  Hard to distribute


                       •  Large memory overhead
                       •  Comparatively slow


        SenseiDB •  Designed for LinkedIn search use
                       cases and the Homepage
Motivation
•    Indexing/Query isolation

•    Structured vs. unstructured data (e.g. fulltext search
     support)

•    Faceted search
Motivation
•    Indexing/Query isolation

•    Structured vs. unstructured data (e.g. fulltext search
     support)

•    Faceted search

•    Business intelligence
Sensei’s features
•    Fast updates

•    Rich query language - BQL

•    Fulltext and faceted search

•    Distributed and elastic

•    Indexing and search customization

•    In memory M/R
What Sensei doesn’t do
—  Transactions and OLTP
—  Dynamic shard rebalancing
—  Multi tenancy and table joins
—  Dynamic schema
Volume

—  5-100 mln documents per node
—  ~300K updates per minute
—  Query latency < 100 ms
Deployments
—  Search engine for SeaS
—  Backend for USCP– 400 nodes
—  >6 deployments in the team $
—  Other companies(2 deployments at Xiaomi)
Sensei’s technologies

        Sensei




                    Lucene
Sensei’s technologies

        Sensei




                    Zoie


                    Lucene
Sensei’s technologies

        Sensei


                    Bobo


                    Zoie


                    Lucene
Sensei’s technologies

            Sensei


                         Bobo
               Norbert


Zookeeper
                         Zoie


                         Lucene
Vocabulary
Node   Shard/Partition   Replica
Vocabulary
Node   Shard/Partition   Replica
High level architecture
Data injection

            Sensei node
                     Event w/ version



              Gateway

                                 Get events with version
                                 bigger than the existing




    JDBC   Databus      RabbitMQ        Kafka
Data guarantees
•    Availability - replications

•    Eventually consistent across replications

•    Write durability - data stream

•    Write consistency - data stream
Configuration
—  schema.xml
   —  Indexed fields,
   —  forward index customization
—  sensei.properties
   —  ports, plugins, zookeeper urls, etc
Features
Lucene realtime extension




            Disk Index
Realtime updates
•    Updates are seen right away < 1s upon inserting

•    Handles deletes and updates

•    Indexing latency stable as index size grows

•    Incremental and balanced segment merges
Hourglass(Time Series)
Offline indexing and archive
•    Efficient M/R indexing generation on Hadoop over
     ETL'd data

•    Bootstrap from HDFS
Query Engine - Bobo
•    Query planning/optimization

•    Access to both inverted and forward data structures

•    High performance faceting

•    Dynamic sorting

•    Dynamic relevance support

•  Map/Reduce analytics engine
Bobo(cont.)



       Custom            Custom            Custom
   (forward) index   (forward) index   (forward) index


                                                         Result



   Lucene segment    Lucene segment    Lucene segment
Sensei API - BQL

 SELECT color, category, year, makemodel
 FROM cars
 WHERE NOT MATCH(color, category)
 AGAINST("*van")
 GROUP BY category TOP 1
 LIMIT 1000
Dynamic relevance
 SELECT *
 FROM cars
 WHERE price > 2000.00
 USING RELEVANCE MODEL my_model
 (favoriteColor:"black", favoriteTag:"cool")
   DEFINED AS (String favoriteColor, String favoriteTag)
    BEGIN
     float boost = 1.0;
     if (tags.contains(favoriteTag))
        boost += 0.5;
     if (color.equals(my_color))
        boost += 1.2;
     return _INNER_SCORE * boost;
    END
Partial updates
—  Storing data outside of Lucene
—  High update rate
—  Perfect for counters
Sensei in memory M/R



         Node1



Broker


         Node2
Sensei in memory M/R


                 map(IntArray docs, FieldAccessor, FacetCountAccessor)

         Node1



Broker


         Node2



                    Lucene segments
Sensei in memory M/R


                 map(IntArray docs, FieldAccessor, FacetCountAccessor)

         Node1



Broker


         Node2



                    Lucene segments
Sensei in memory M/R


                       List<MapResult> combine(List<MapResult>)

         Node1



Broker


         Node2



                 Lucene segments
Sensei in memory M/R


                       List<MapResult> combine(List<MapResult>)
                                       Node1
         Node1



Broker


         Node2                         Node1



                 Lucene segments
Sensei in memory M/R


                             JSONObject reduce(List<MapResult>)
                                      Node1
         Node1



Broker                                            Broker


         Node2                        Node1



                 Lucene segments
Sensei in memory M/R

—  select distinctCount(memberId), sum(clickCount)
  where geo = ‘US/CA/SF’ group by seniority, age
Roadmap
•    Just finished
     o    Sensei aggregation functions
     o    Map/Reduce analytics engine

•    Plan
     o    Goshawk – for business inteligence (WVMP v2, LI
          Impressions)
     o    Zoie Redesign to support fixed length in memory
          segments
Sensei tweets demo
Questions?
—  SeaS Homepage: http://go/seas
—  Questions: ask_seas@
—  Sensei homepage: senseidb.com
—  Sensei Google group: sensei-search

Weitere ähnliche Inhalte

Ähnlich wie SenseiDB

CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersNiko Neugebauer
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmaplucenerevolution
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road maplucenerevolution
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Mohammad Asif
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scaleRan Levy
 
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)Jeff Chu
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionKelum Senanayake
 
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...Michael Noel
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...lucenerevolution
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLMongoDB
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 

Ähnlich wie SenseiDB (20)

CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Solr -
Solr - Solr -
Solr -
 
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote   Yonik Seeley & Steve Rowe lucene solr roadmapKeynote   Yonik Seeley & Steve Rowe lucene solr roadmap
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
 
KEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road mapKEYNOTE: Lucene / Solr road map
KEYNOTE: Lucene / Solr road map
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scale
 
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)
Innovations of .NET and Azure (Recaps of Build 2017 selected sessions)
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Couchbase - Yet Another Introduction
Couchbase - Yet Another IntroductionCouchbase - Yet Another Introduction
Couchbase - Yet Another Introduction
 
Nov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on HadoopNov 2011 HUG: Blur - Lucene on Hadoop
Nov 2011 HUG: Blur - Lucene on Hadoop
 
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...
TechEd Africa 2011 - OFC307: Architecting a Disaster Tolerant and Highly Avai...
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Solr 101
Solr 101Solr 101
Solr 101
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 

SenseiDB

  • 1. Sensei Volodymyr Zhabiuk
  • 2. Agenda 1.  History and motivation 2.  High level architecture 3.  Data guarantees 4.  Features detailed overview 5.  Quick demo
  • 3. What is Sensei —  search engine and database —  Built on top of Lucene —  Full text search, relevance, faceting —  Distributed, horizontally scalable
  • 4. History •  Technology stack for LinkedIn.com's search, analytics and homepage •  Open sourced in 2009, first 1.0.0 release February 2012 •  https://github.com/linkedin/sensei •  http://senseidb.com —  sensei-search Google group —  Used by Xiaomi, several other OS deployments
  • 5. Why yet another Lucene based search engine?
  • 6. Why yet another Lucene based search engine? •  Indexing elevates query latency •  Hard to distribute
  • 7. Why yet another Lucene based search engine? •  Indexing elevates query latency •  Hard to distribute •  Large memory overhead •  Comparatively slow
  • 8. Why yet another Lucene based search engine? •  Indexing elevates query latency •  Hard to distribute •  Large memory overhead •  Comparatively slow SenseiDB •  Designed for LinkedIn search use cases and the Homepage
  • 9. Motivation •  Indexing/Query isolation •  Structured vs. unstructured data (e.g. fulltext search support) •  Faceted search
  • 10. Motivation •  Indexing/Query isolation •  Structured vs. unstructured data (e.g. fulltext search support) •  Faceted search •  Business intelligence
  • 11. Sensei’s features •  Fast updates •  Rich query language - BQL •  Fulltext and faceted search •  Distributed and elastic •  Indexing and search customization •  In memory M/R
  • 12. What Sensei doesn’t do —  Transactions and OLTP —  Dynamic shard rebalancing —  Multi tenancy and table joins —  Dynamic schema
  • 13. Volume —  5-100 mln documents per node —  ~300K updates per minute —  Query latency < 100 ms
  • 14. Deployments —  Search engine for SeaS —  Backend for USCP– 400 nodes —  >6 deployments in the team $ —  Other companies(2 deployments at Xiaomi)
  • 15. Sensei’s technologies Sensei Lucene
  • 16. Sensei’s technologies Sensei Zoie Lucene
  • 17. Sensei’s technologies Sensei Bobo Zoie Lucene
  • 18. Sensei’s technologies Sensei Bobo Norbert Zookeeper Zoie Lucene
  • 19. Vocabulary Node Shard/Partition Replica
  • 20. Vocabulary Node Shard/Partition Replica
  • 22. Data injection Sensei node Event w/ version Gateway Get events with version bigger than the existing JDBC Databus RabbitMQ Kafka
  • 23. Data guarantees •  Availability - replications •  Eventually consistent across replications •  Write durability - data stream •  Write consistency - data stream
  • 24. Configuration —  schema.xml —  Indexed fields, —  forward index customization —  sensei.properties —  ports, plugins, zookeeper urls, etc
  • 27. Realtime updates •  Updates are seen right away < 1s upon inserting •  Handles deletes and updates •  Indexing latency stable as index size grows •  Incremental and balanced segment merges
  • 29. Offline indexing and archive •  Efficient M/R indexing generation on Hadoop over ETL'd data •  Bootstrap from HDFS
  • 30. Query Engine - Bobo •  Query planning/optimization •  Access to both inverted and forward data structures •  High performance faceting •  Dynamic sorting •  Dynamic relevance support •  Map/Reduce analytics engine
  • 31. Bobo(cont.) Custom Custom Custom (forward) index (forward) index (forward) index Result Lucene segment Lucene segment Lucene segment
  • 32. Sensei API - BQL SELECT color, category, year, makemodel FROM cars WHERE NOT MATCH(color, category) AGAINST("*van") GROUP BY category TOP 1 LIMIT 1000
  • 33. Dynamic relevance SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END
  • 34. Partial updates —  Storing data outside of Lucene —  High update rate —  Perfect for counters
  • 35. Sensei in memory M/R Node1 Broker Node2
  • 36. Sensei in memory M/R map(IntArray docs, FieldAccessor, FacetCountAccessor) Node1 Broker Node2 Lucene segments
  • 37. Sensei in memory M/R map(IntArray docs, FieldAccessor, FacetCountAccessor) Node1 Broker Node2 Lucene segments
  • 38. Sensei in memory M/R List<MapResult> combine(List<MapResult>) Node1 Broker Node2 Lucene segments
  • 39. Sensei in memory M/R List<MapResult> combine(List<MapResult>) Node1 Node1 Broker Node2 Node1 Lucene segments
  • 40. Sensei in memory M/R JSONObject reduce(List<MapResult>) Node1 Node1 Broker Broker Node2 Node1 Lucene segments
  • 41. Sensei in memory M/R —  select distinctCount(memberId), sum(clickCount) where geo = ‘US/CA/SF’ group by seniority, age
  • 42. Roadmap •  Just finished o  Sensei aggregation functions o  Map/Reduce analytics engine •  Plan o  Goshawk – for business inteligence (WVMP v2, LI Impressions) o  Zoie Redesign to support fixed length in memory segments
  • 44. Questions? —  SeaS Homepage: http://go/seas —  Questions: ask_seas@ —  Sensei homepage: senseidb.com —  Sensei Google group: sensei-search