SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
15th January 2013 – Hippo meetup

Luca Cavanna
Software developer & Search consultant at Trifork Amsterdam

luca.cavanna@trifork.nl   - @lucacavanna
Trifork (aka Jteam/Dutchworks/Orange11)




     Focus areas:
     –   Big data & Search
     –   Mobile
     –   Custom solutions
     –   Knowledge (GOTO Amsterdam)


 ●   Hippo partner


 ●   Hippo related search projects:
     –   uva.nl
     –   working on rijksoverheid.nl
Agenda



●   Search introduction
    –   Lucene foundation
    –   Why do we need Solr or elasticsearch?
●   Scaling with Solr
●   Elasticsearch distributed nature
●   Elasticsearch features
Apache Lucene



●   High-performance, full-featured text search engine
    library written entirely in Java

●   It indexes documents as collections of fields

●   A field is a string based key-value pair

●   What data structure does it use under the hood?
Inverted index

                                                       term    freq   Posting list
1   The old night keeper keeps the keep in the town    and      1     6
                                                       big      2     23
2   In the big old house in the big old gown.
                                                       dark     1     6
3   The house in the town had the big old keep         did      1     4
                                                      grown     1     2
4   Where the old night keeper never did sleep.
                                                       had      1     3
                                                      house     2     23
5   The night keeper keeps the keep in the night
                                                        in      5     12356
6   And keeps in the dark and sleeps in the light.    keep      3     135
                                                      keeper    3     145
                                                      keeps     3     156
                                                       light    1     6
                                                      never     1     4
                                                      night     3     145
                                                       old      4     1234
                                                      sleep     1     4
                                                      sleeps    1     6
                                                       the      6     123456
                                                      town      2     13
                                                      where     1     4
Inverted index



●   Indexing
    –   Text analysis
         ●   Tokenization, lowercasing and more


●   The inverted index can contain more data
    –   Term offsets and more


●   The inverted index itself doesn't contain the text for
    displaying the search results
Indexing



●   Lucene writes indexes as segments
●   Segments are not modifiable: Write-Once
●   Each segment is a searchable mini index

●   Each segment contains
    –   Inverted index
    –   Stored fields
    –   ...and more
Indexing: the commit operation



●   Documents are searchable only after a commit!

●   Commit gives also durability

●   The most expensive operation in Lucene!!!
Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)



 ●   With the Lucene near-real time API you don't need a
     commit to make new documents searchable

 ●   Less expensive than commit

 ●   Doesn't guarantee durability though

 ●   Exposed as soft commit in Solr 4.0
Lucene code example – indexing data


 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
             new StandardAnalyzer(Version.LUCENE_40));
 Directory directory = FSDirectory.open(new File("data"));
 IndexWriter writer = new IndexWriter(directory, config);

 Document document = new Document();

 FieldType idFieldType = new FieldType();
 idFieldType.setIndexed(true);
 idFieldType.setStored(true);
 idFieldType.setTokenized(false);
 document.add(new Field("id","id-1", idFieldType));

 FieldType titleFieldType = new FieldType();
 titleFieldType.setIndexed(true);
 titleFieldType.setStored(true);
 document.add(new Field("title","This is the title", titleFieldType));

 FieldType descriptionFieldType = new FieldType();
 descriptionFieldType.setIndexed(true);
 document.add(new Field("description","This is the description", descriptionFieldType));

 writer.addDocument(document);

 writer.close();
Lucene code example – querying and showing results



 QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title",
             new StandardAnalyzer(Version.LUCENE_40));
 Query query = queryParser.parse(queryAsString);

 Directory directory = FSDirectory.open(new File("data"));
 IndexReader indexReader = DirectoryReader.open(directory);
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);
 TopDocs topDocs = indexSearcher.search(query, 10);

 System.out.println("Total hits: " + topDocs.totalHits);

 for (ScoreDoc hit : topDocs.scoreDocs) {
     Document document = indexSearcher.doc(hit.doc);
     for (IndexableField field : document) {
         System.out.println(field.name() + ": " + field.stringValue());
     }
 }
What's missing?


 ●   A common way to represent documents
 ●   Interface to send document to (HTTP)
 ●   A way to represent queries
 ●   Interface to send queries to (HTTP)
 ●   Configuration
 ●   Caching
 ●   Distributed infrastructure
 ●   And more....
Enterprise search servers
Scaling – why?


 ‣ The more concurrent searches you run, the slower they
   get

 ‣ Indexing and searching on the same machine will
   substantially harm search performance

    ‣ Segment merging may be CPU/IO intensive
      operations

    ‣ Disk cache invalidation

 ‣ Fail over
Solr replication example
Solr replication (pull approach)


   • Master-slave based solution
   • Single machine for indexing data (master)
   • Multiple machines for querying (slaves)
   • Master is not aware of the slaves
   • Slave is aware of the master
   • Load balancer responsible for balancing the query
     requests

   • What about real-time search? No way!
SolrCloud


   • A set of new distributed capabilities in Solr
      • uses Apache Zookeeper as a system of record for
       the cluster state, for central configuration, and for
       leader election

   • Whatever server (shard) you send data to:
     • the documents get distributed over the shards
     • A shard can be a leader or a replica and contains a
       subset of the data

   • Easily scale up adding new Solr nodes
elasticsearch




●   Distributed search engine built on top of Lucene
●   Apache 2 license
●   Written in Java
●   RESTful
●   Created and mainly developed by Shay Banon
●   A company behind it: elasticsearch.com
●   Regular releases
    –   Latest release 0.20.2
elasticsearch



●   Schemaless
    –   Uses defaults and automatic type guessing
    –   Custom mappings may be defined if needed
●   JSON oriented
●   Multi tenancy
    –   Multiple indexes per node, multiple types per index
●   Designed to be distributed from the beginning
●   Almost everything is available as API (including
    configuration)
●   Wide range of administration APIs
elasticsearch distributed terminology



●   Node: a running instance of elasticsearch which belongs
    to a cluster (usually one node per server)
●   Cluster: one or more nodes with the same cluster name
●   Shard: a single Lucene instance. A low-level worker unit
    managed by elasticsearch. An index is split into one or
    more shards.
●   Index: a logical namespace which points to one or more
    shards
    –   Your code won't deal directly with a shard, only with
        an index
    –   But an index is composed of more lucene indexes
        (one per shard)
elasticsearch distributed terminology




●   More shards:
    –   improve indexing performance
    –   increase data distribution (depends on # of nodes)
    –   Watch out: each shard has a cost as well!


●   More replicas:
    –   increase failover
    –   improve querying performance
Transaction Log


   • Indexed docs are fully persistent
      • No need for a Lucene IndexWriter#commit
   • Managed using a transaction log / WAL
   • Full single node durability (kill dash 9)
   • Utilized when doing hot relocation of shards
   • Periodically “flushed” (calling IW#commit)
   • Durability and real time search together!
Index - Shards & Replicas



      Node                  Node




                               curl -XPUT localhost:9200/hippo -d '
                               {
                                  "index" : {
                  Client             "number_of_shards" : 2,
                                     "number_of_replicas" : 1
                                  }
                               }'
Index - Shards & Replicas



         Node                     Node
              Shard 0               Shard 0
             (primary)              (replica)


             Shard 1                 Shard 1
             (replica)              (primary)




                                    curl -XPUT localhost:9200/hippo -d '
                                    {
                                       "index" : {
                         Client           "number_of_shards" : 2,
                                          "number_of_replicas" : 1
                                       }
                                    }'
Indexing - 1


   • Automatic sharding, push replication
     Node                    Node
         Shard 0               Shard 0
        (primary)              (replica)

        Shard 1                 Shard 1
        (replica)              (primary)



                              curl -XPUT localhost:9200/hippo/users/1 -d '
                              {
                                 "name" : {
                                    "first" : "Luca",
                    Client          "last" : "Cavanna"
                                 }
                              }'
Indexing - 2



      Node                    Node
          Shard 0               Shard 0
         (primary)              (replica)

         Shard 1                 Shard 1
         (replica)              (primary)




                               curl -XPUT localhost:9200/hippo/users/2 -d '
                               {
                                  "name" : {
                     Client          "first" : "Jeroen",
                                     "last" : "Reijn"
                                  }
                               }'
Search - 1


   • Scatter / Gather search
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                  Shard 1                 Shard 1
                  (replica)              (primary)




                              Client


curl -XPUT localhost:9200/hippo/_search?q=luca
Search - 2


   • Automatic balancing between replicas
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                  Shard 1                 Shard 1
                  (replica)              (primary)




                              Client


curl -XPUT localhost:9200/hippo/_search?q=luca
Search - 3


   • Automatic failover
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                 Shard 1                  Shard 1
                 (replica)   failure     (primary)




                              Client


 curl -XPUT localhost:9200/hippo/_search?q=luca
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node
        Shard 0           Shard 0
       (primary)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node                Node
        Shard 0           Shard 0
       (primary)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node                Node
        Shard 0           Shard 0            Shard 0
       (primary)          (replica)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Node failure




    Node            Node          Node
        Shard 0                     Shard 0
       (primary)                    (replica)

        Shard 1        Shard 1
        (replica)     (primary)
Node failure - 1


   • Replicas can automatically become primaries


                       Node              Node
                                             Shard 0
                                            (primary)

                           Shard 1
                          (primary)
Node failure - 2


   • Shards are automatically assigned and do “hot”
     recovery


                       Node               Node
                          Shard 0            Shard 0
                          (replica)         (primary)

                           Shard 1           Shard 1
                          (primary)          (replica)
Dynamic Replicas



    Node           Node                  Node
        Shard 0      Shard 0
       (primary)     (replica)




                             curl -XPUT localhost:9200/hippo -d '
                             {
                                "index" : {
                                   "number_of_shards" : 1,
                                   "number_of_replicas" : 1
         Client                 }
                             }'
Dynamic Replicas



    Node           Node                  Node
        Shard 0      Shard 0                 Shard 0
       (primary)     (replica)               (replica)




                          curl -XPUT localhost:9200/hippo -d '
                          {
                             "index" : {
       Client                   "number_of_replicas" : 2
                             }
                          }'
Indexing (Push) - ElasticSearch


 •   Documents added through push requests

 •   Full JSON Object representation of Documents supported

      •   Embedded objects

 •   1st class Parent / Child and Versioning

 •   Near Realtime index refreshing available

 •   Realtime get supported       {
                                      "name": "Luca Cavanna",
                                      "location": {
                                         "city": "Amsterdam",
                                         "country": "The Netherlands"
                                      }
                                  }
Indexing (Pull) - ElasticSearch


 •   Data flows from sources using ‘Rivers’

 •   Continues to add data as it ‘flows’

 •   Can be added, removed, configured dynamically

 •   Out-of-the-box support for CouchDB, Twitter (implemented by the es
     team)

 •   Community implementations for DBs, other NoSQL and Solr

                       River



                       River
Searching - ElasticSearch

 •   Search request in Request Body

 •   Powerful and extensible Query DSL

 •   Separation of Query and Filters

 •   Named Filters allowing tracking of which Documents matched which
     Filters

 •   By default storing the source of each document (_source field)

 •   Catch all feature enabled by default (_all field)

 •   Sorting of results

 •   Highlighting, Faceting, Boosting...and more
Search Example - ElasticSearch

$ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '
{
   "query" : {                               {
     "term" : { "first_name" : "luca" }         "_shards": {
   }                                               "total" : 5,
}'                                                 "successful" : 5,
                                                     "failed" : 0
                                                   },
                                                   "hits": {
                                                      "total" : 1,
                                                      "hits" : [
                                                         {
                                                            "_index" : "hippo",
                                                            "_type" : "users",
                                                            "_id" : "1",
                                                            "_source" : {
                                                               "first_name" : "Luca",
                                                               "last_name" : "Cavanna"
                                                            }
                                                         }
                                                      ]
                                                   }
                                               }
Thanks


  There would be a lot more to say:
    • Query DSL

    • Scripting module (pluggable implementation)

    • Percolator

    • Running it embedded

   Check them out yourself if you are interested!

  Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196Mahmoud Samir Fayed
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper Omid Vahdaty
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with PacemakerKris Buytaert
 
OWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesOWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesChristopher Frohoff
 
Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmPawel Szulc
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11なおき きしだ
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11なおき きしだ
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL ClusterKris Buytaert
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84Mahmoud Samir Fayed
 

Was ist angesagt? (20)

Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
 
04 - Qt Data
04 - Qt Data04 - Qt Data
04 - Qt Data
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
Mongo db roma replication and sharding
Mongo db roma replication and shardingMongo db roma replication and sharding
Mongo db roma replication and sharding
 
OWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesOWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling Pickles
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Vert.X mini-talk
Vert.X mini-talkVert.X mini-talk
Vert.X mini-talk
 
Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvm
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL Cluster
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
 

Ähnlich wie Hippo meetup: enterprise search with Solr and elasticsearch

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta MapR Technologies
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer serviceZang Donal
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1medcl
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Sneeker Yeh
 
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Lucidworks
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 

Ähnlich wie Hippo meetup: enterprise search with Solr and elasticsearch (20)

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValuesColumn Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
 
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Roscon2021 Executor
Roscon2021 ExecutorRoscon2021 Executor
Roscon2021 Executor
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Hadoop For OpenStack Log Analysis
Hadoop For OpenStack Log AnalysisHadoop For OpenStack Log Analysis
Hadoop For OpenStack Log Analysis
 

Kürzlich hochgeladen

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Kürzlich hochgeladen (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Hippo meetup: enterprise search with Solr and elasticsearch

  • 1. 15th January 2013 – Hippo meetup Luca Cavanna Software developer & Search consultant at Trifork Amsterdam luca.cavanna@trifork.nl - @lucacavanna
  • 2. Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) ● Hippo partner ● Hippo related search projects: – uva.nl – working on rijksoverheid.nl
  • 3. Agenda ● Search introduction – Lucene foundation – Why do we need Solr or elasticsearch? ● Scaling with Solr ● Elasticsearch distributed nature ● Elasticsearch features
  • 4. Apache Lucene ● High-performance, full-featured text search engine library written entirely in Java ● It indexes documents as collections of fields ● A field is a string based key-value pair ● What data structure does it use under the hood?
  • 5. Inverted index term freq Posting list 1 The old night keeper keeps the keep in the town and 1 6 big 2 23 2 In the big old house in the big old gown. dark 1 6 3 The house in the town had the big old keep did 1 4 grown 1 2 4 Where the old night keeper never did sleep. had 1 3 house 2 23 5 The night keeper keeps the keep in the night in 5 12356 6 And keeps in the dark and sleeps in the light. keep 3 135 keeper 3 145 keeps 3 156 light 1 6 never 1 4 night 3 145 old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4
  • 6. Inverted index ● Indexing – Text analysis ● Tokenization, lowercasing and more ● The inverted index can contain more data – Term offsets and more ● The inverted index itself doesn't contain the text for displaying the search results
  • 7. Indexing ● Lucene writes indexes as segments ● Segments are not modifiable: Write-Once ● Each segment is a searchable mini index ● Each segment contains – Inverted index – Stored fields – ...and more
  • 8. Indexing: the commit operation ● Documents are searchable only after a commit! ● Commit gives also durability ● The most expensive operation in Lucene!!!
  • 9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ● With the Lucene near-real time API you don't need a commit to make new documents searchable ● Less expensive than commit ● Doesn't guarantee durability though ● Exposed as soft commit in Solr 4.0
  • 10. Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
  • 11. Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
  • 12. What's missing? ● A common way to represent documents ● Interface to send document to (HTTP) ● A way to represent queries ● Interface to send queries to (HTTP) ● Configuration ● Caching ● Distributed infrastructure ● And more....
  • 14. Scaling – why? ‣ The more concurrent searches you run, the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
  • 16. Solr replication (pull approach) • Master-slave based solution • Single machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
  • 17. SolrCloud • A set of new distributed capabilities in Solr • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
  • 18. elasticsearch ● Distributed search engine built on top of Lucene ● Apache 2 license ● Written in Java ● RESTful ● Created and mainly developed by Shay Banon ● A company behind it: elasticsearch.com ● Regular releases – Latest release 0.20.2
  • 19. elasticsearch ● Schemaless – Uses defaults and automatic type guessing – Custom mappings may be defined if needed ● JSON oriented ● Multi tenancy – Multiple indexes per node, multiple types per index ● Designed to be distributed from the beginning ● Almost everything is available as API (including configuration) ● Wide range of administration APIs
  • 20. elasticsearch distributed terminology ● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server) ● Cluster: one or more nodes with the same cluster name ● Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards. ● Index: a logical namespace which points to one or more shards – Your code won't deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
  • 21. elasticsearch distributed terminology ● More shards: – improve indexing performance – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well! ● More replicas: – increase failover – improve querying performance
  • 22. Transaction Log • Indexed docs are fully persistent • No need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
  • 23. Index - Shards & Replicas Node Node curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  • 24. Index - Shards & Replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  • 25. Indexing - 1 • Automatic sharding, push replication Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/1 -d ' { "name" : { "first" : "Luca", Client "last" : "Cavanna" } }'
  • 26. Indexing - 2 Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/2 -d ' { "name" : { Client "first" : "Jeroen", "last" : "Reijn" } }'
  • 27. Search - 1 • Scatter / Gather search Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 28. Search - 2 • Automatic balancing between replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 29. Search - 3 • Automatic failover Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) failure (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 30. Adding a node • “Hot” reallocation of shards to the new node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 31. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 32. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) Shard 1 Shard 1 (replica) (primary)
  • 33. Node failure Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 34. Node failure - 1 • Replicas can automatically become primaries Node Node Shard 0 (primary) Shard 1 (primary)
  • 35. Node failure - 2 • Shards are automatically assigned and do “hot” recovery Node Node Shard 0 Shard 0 (replica) (primary) Shard 1 Shard 1 (primary) (replica)
  • 36. Dynamic Replicas Node Node Node Shard 0 Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } }'
  • 37. Dynamic Replicas Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_replicas" : 2 } }'
  • 38. Indexing (Push) - ElasticSearch • Documents added through push requests • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
  • 39. Indexing (Pull) - ElasticSearch • Data flows from sources using ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River
  • 40. Searching - ElasticSearch • Search request in Request Body • Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
  • 41. Search Example - ElasticSearch $ curl -XGET 'http://localhost:9200/hippo/users/_search' -d ' { "query" : { { "term" : { "first_name" : "luca" } "_shards": { } "total" : 5, }' "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
  • 42. Thanks There would be a lot more to say: • Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?