SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
15th January 2013 – Hippo meetup

Luca Cavanna
Software developer & Search consultant at Trifork Amsterdam

luca.cavanna@trifork.nl   - @lucacavanna
Trifork (aka Jteam/Dutchworks/Orange11)




     Focus areas:
     –   Big data & Search
     –   Mobile
     –   Custom solutions
     –   Knowledge (GOTO Amsterdam)


 ●   Hippo partner


 ●   Hippo related search projects:
     –   uva.nl
     –   working on rijksoverheid.nl
Agenda



●   Search introduction
    –   Lucene foundation
    –   Why do we need Solr or elasticsearch?
●   Scaling with Solr
●   Elasticsearch distributed nature
●   Elasticsearch features
Apache Lucene



●   High-performance, full-featured text search engine
    library written entirely in Java

●   It indexes documents as collections of fields

●   A field is a string based key-value pair

●   What data structure does it use under the hood?
Inverted index

                                                       term    freq   Posting list
1   The old night keeper keeps the keep in the town    and      1     6
                                                       big      2     23
2   In the big old house in the big old gown.
                                                       dark     1     6
3   The house in the town had the big old keep         did      1     4
                                                      grown     1     2
4   Where the old night keeper never did sleep.
                                                       had      1     3
                                                      house     2     23
5   The night keeper keeps the keep in the night
                                                        in      5     12356
6   And keeps in the dark and sleeps in the light.    keep      3     135
                                                      keeper    3     145
                                                      keeps     3     156
                                                       light    1     6
                                                      never     1     4
                                                      night     3     145
                                                       old      4     1234
                                                      sleep     1     4
                                                      sleeps    1     6
                                                       the      6     123456
                                                      town      2     13
                                                      where     1     4
Inverted index



●   Indexing
    –   Text analysis
         ●   Tokenization, lowercasing and more


●   The inverted index can contain more data
    –   Term offsets and more


●   The inverted index itself doesn't contain the text for
    displaying the search results
Indexing



●   Lucene writes indexes as segments
●   Segments are not modifiable: Write-Once
●   Each segment is a searchable mini index

●   Each segment contains
    –   Inverted index
    –   Stored fields
    –   ...and more
Indexing: the commit operation



●   Documents are searchable only after a commit!

●   Commit gives also durability

●   The most expensive operation in Lucene!!!
Near-real-time search (since Lucene 2.9, exposed in Solr 4.0)



 ●   With the Lucene near-real time API you don't need a
     commit to make new documents searchable

 ●   Less expensive than commit

 ●   Doesn't guarantee durability though

 ●   Exposed as soft commit in Solr 4.0
Lucene code example – indexing data


 IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
             new StandardAnalyzer(Version.LUCENE_40));
 Directory directory = FSDirectory.open(new File("data"));
 IndexWriter writer = new IndexWriter(directory, config);

 Document document = new Document();

 FieldType idFieldType = new FieldType();
 idFieldType.setIndexed(true);
 idFieldType.setStored(true);
 idFieldType.setTokenized(false);
 document.add(new Field("id","id-1", idFieldType));

 FieldType titleFieldType = new FieldType();
 titleFieldType.setIndexed(true);
 titleFieldType.setStored(true);
 document.add(new Field("title","This is the title", titleFieldType));

 FieldType descriptionFieldType = new FieldType();
 descriptionFieldType.setIndexed(true);
 document.add(new Field("description","This is the description", descriptionFieldType));

 writer.addDocument(document);

 writer.close();
Lucene code example – querying and showing results



 QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title",
             new StandardAnalyzer(Version.LUCENE_40));
 Query query = queryParser.parse(queryAsString);

 Directory directory = FSDirectory.open(new File("data"));
 IndexReader indexReader = DirectoryReader.open(directory);
 IndexSearcher indexSearcher = new IndexSearcher(indexReader);
 TopDocs topDocs = indexSearcher.search(query, 10);

 System.out.println("Total hits: " + topDocs.totalHits);

 for (ScoreDoc hit : topDocs.scoreDocs) {
     Document document = indexSearcher.doc(hit.doc);
     for (IndexableField field : document) {
         System.out.println(field.name() + ": " + field.stringValue());
     }
 }
What's missing?


 ●   A common way to represent documents
 ●   Interface to send document to (HTTP)
 ●   A way to represent queries
 ●   Interface to send queries to (HTTP)
 ●   Configuration
 ●   Caching
 ●   Distributed infrastructure
 ●   And more....
Enterprise search servers
Scaling – why?


 ‣ The more concurrent searches you run, the slower they
   get

 ‣ Indexing and searching on the same machine will
   substantially harm search performance

    ‣ Segment merging may be CPU/IO intensive
      operations

    ‣ Disk cache invalidation

 ‣ Fail over
Solr replication example
Solr replication (pull approach)


   • Master-slave based solution
   • Single machine for indexing data (master)
   • Multiple machines for querying (slaves)
   • Master is not aware of the slaves
   • Slave is aware of the master
   • Load balancer responsible for balancing the query
     requests

   • What about real-time search? No way!
SolrCloud


   • A set of new distributed capabilities in Solr
      • uses Apache Zookeeper as a system of record for
       the cluster state, for central configuration, and for
       leader election

   • Whatever server (shard) you send data to:
     • the documents get distributed over the shards
     • A shard can be a leader or a replica and contains a
       subset of the data

   • Easily scale up adding new Solr nodes
elasticsearch




●   Distributed search engine built on top of Lucene
●   Apache 2 license
●   Written in Java
●   RESTful
●   Created and mainly developed by Shay Banon
●   A company behind it: elasticsearch.com
●   Regular releases
    –   Latest release 0.20.2
elasticsearch



●   Schemaless
    –   Uses defaults and automatic type guessing
    –   Custom mappings may be defined if needed
●   JSON oriented
●   Multi tenancy
    –   Multiple indexes per node, multiple types per index
●   Designed to be distributed from the beginning
●   Almost everything is available as API (including
    configuration)
●   Wide range of administration APIs
elasticsearch distributed terminology



●   Node: a running instance of elasticsearch which belongs
    to a cluster (usually one node per server)
●   Cluster: one or more nodes with the same cluster name
●   Shard: a single Lucene instance. A low-level worker unit
    managed by elasticsearch. An index is split into one or
    more shards.
●   Index: a logical namespace which points to one or more
    shards
    –   Your code won't deal directly with a shard, only with
        an index
    –   But an index is composed of more lucene indexes
        (one per shard)
elasticsearch distributed terminology




●   More shards:
    –   improve indexing performance
    –   increase data distribution (depends on # of nodes)
    –   Watch out: each shard has a cost as well!


●   More replicas:
    –   increase failover
    –   improve querying performance
Transaction Log


   • Indexed docs are fully persistent
      • No need for a Lucene IndexWriter#commit
   • Managed using a transaction log / WAL
   • Full single node durability (kill dash 9)
   • Utilized when doing hot relocation of shards
   • Periodically “flushed” (calling IW#commit)
   • Durability and real time search together!
Index - Shards & Replicas



      Node                  Node




                               curl -XPUT localhost:9200/hippo -d '
                               {
                                  "index" : {
                  Client             "number_of_shards" : 2,
                                     "number_of_replicas" : 1
                                  }
                               }'
Index - Shards & Replicas



         Node                     Node
              Shard 0               Shard 0
             (primary)              (replica)


             Shard 1                 Shard 1
             (replica)              (primary)




                                    curl -XPUT localhost:9200/hippo -d '
                                    {
                                       "index" : {
                         Client           "number_of_shards" : 2,
                                          "number_of_replicas" : 1
                                       }
                                    }'
Indexing - 1


   • Automatic sharding, push replication
     Node                    Node
         Shard 0               Shard 0
        (primary)              (replica)

        Shard 1                 Shard 1
        (replica)              (primary)



                              curl -XPUT localhost:9200/hippo/users/1 -d '
                              {
                                 "name" : {
                                    "first" : "Luca",
                    Client          "last" : "Cavanna"
                                 }
                              }'
Indexing - 2



      Node                    Node
          Shard 0               Shard 0
         (primary)              (replica)

         Shard 1                 Shard 1
         (replica)              (primary)




                               curl -XPUT localhost:9200/hippo/users/2 -d '
                               {
                                  "name" : {
                     Client          "first" : "Jeroen",
                                     "last" : "Reijn"
                                  }
                               }'
Search - 1


   • Scatter / Gather search
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                  Shard 1                 Shard 1
                  (replica)              (primary)




                              Client


curl -XPUT localhost:9200/hippo/_search?q=luca
Search - 2


   • Automatic balancing between replicas
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                  Shard 1                 Shard 1
                  (replica)              (primary)




                              Client


curl -XPUT localhost:9200/hippo/_search?q=luca
Search - 3


   • Automatic failover
              Node                     Node
                  Shard 0                Shard 0
                 (primary)               (replica)

                 Shard 1                  Shard 1
                 (replica)   failure     (primary)




                              Client


 curl -XPUT localhost:9200/hippo/_search?q=luca
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node
        Shard 0           Shard 0
       (primary)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node                Node
        Shard 0           Shard 0
       (primary)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Adding a node


  • “Hot” reallocation of shards to the new node


    Node              Node                Node
        Shard 0           Shard 0            Shard 0
       (primary)          (replica)          (replica)

       Shard 1            Shard 1
       (replica)         (primary)
Node failure




    Node            Node          Node
        Shard 0                     Shard 0
       (primary)                    (replica)

        Shard 1        Shard 1
        (replica)     (primary)
Node failure - 1


   • Replicas can automatically become primaries


                       Node              Node
                                             Shard 0
                                            (primary)

                           Shard 1
                          (primary)
Node failure - 2


   • Shards are automatically assigned and do “hot”
     recovery


                       Node               Node
                          Shard 0            Shard 0
                          (replica)         (primary)

                           Shard 1           Shard 1
                          (primary)          (replica)
Dynamic Replicas



    Node           Node                  Node
        Shard 0      Shard 0
       (primary)     (replica)




                             curl -XPUT localhost:9200/hippo -d '
                             {
                                "index" : {
                                   "number_of_shards" : 1,
                                   "number_of_replicas" : 1
         Client                 }
                             }'
Dynamic Replicas



    Node           Node                  Node
        Shard 0      Shard 0                 Shard 0
       (primary)     (replica)               (replica)




                          curl -XPUT localhost:9200/hippo -d '
                          {
                             "index" : {
       Client                   "number_of_replicas" : 2
                             }
                          }'
Indexing (Push) - ElasticSearch


 •   Documents added through push requests

 •   Full JSON Object representation of Documents supported

      •   Embedded objects

 •   1st class Parent / Child and Versioning

 •   Near Realtime index refreshing available

 •   Realtime get supported       {
                                      "name": "Luca Cavanna",
                                      "location": {
                                         "city": "Amsterdam",
                                         "country": "The Netherlands"
                                      }
                                  }
Indexing (Pull) - ElasticSearch


 •   Data flows from sources using ‘Rivers’

 •   Continues to add data as it ‘flows’

 •   Can be added, removed, configured dynamically

 •   Out-of-the-box support for CouchDB, Twitter (implemented by the es
     team)

 •   Community implementations for DBs, other NoSQL and Solr

                       River



                       River
Searching - ElasticSearch

 •   Search request in Request Body

 •   Powerful and extensible Query DSL

 •   Separation of Query and Filters

 •   Named Filters allowing tracking of which Documents matched which
     Filters

 •   By default storing the source of each document (_source field)

 •   Catch all feature enabled by default (_all field)

 •   Sorting of results

 •   Highlighting, Faceting, Boosting...and more
Search Example - ElasticSearch

$ curl -XGET 'http://localhost:9200/hippo/users/_search' -d '
{
   "query" : {                               {
     "term" : { "first_name" : "luca" }         "_shards": {
   }                                               "total" : 5,
}'                                                 "successful" : 5,
                                                     "failed" : 0
                                                   },
                                                   "hits": {
                                                      "total" : 1,
                                                      "hits" : [
                                                         {
                                                            "_index" : "hippo",
                                                            "_type" : "users",
                                                            "_id" : "1",
                                                            "_source" : {
                                                               "first_name" : "Luca",
                                                               "last_name" : "Cavanna"
                                                            }
                                                         }
                                                      ]
                                                   }
                                               }
Thanks


  There would be a lot more to say:
    • Query DSL

    • Scripting module (pluggable implementation)

    • Percolator

    • Running it embedded

   Check them out yourself if you are interested!

  Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196Mahmoud Samir Fayed
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper Omid Vahdaty
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with PacemakerKris Buytaert
 
OWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesOWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesChristopher Frohoff
 
Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmPawel Szulc
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11なおき きしだ
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11なおき きしだ
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL ClusterKris Buytaert
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84Mahmoud Samir Fayed
 

Was ist angesagt? (20)

Kyotoproducts
KyotoproductsKyotoproducts
Kyotoproducts
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196The Ring programming language version 1.7 book - Part 195 of 196
The Ring programming language version 1.7 book - Part 195 of 196
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
 
04 - Qt Data
04 - Qt Data04 - Qt Data
04 - Qt Data
 
MySQL HA with Pacemaker
MySQL HA with  PacemakerMySQL HA with  Pacemaker
MySQL HA with Pacemaker
 
Mongo db roma replication and sharding
Mongo db roma replication and shardingMongo db roma replication and sharding
Mongo db roma replication and sharding
 
OWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling PicklesOWASP AppSecCali 2015 - Marshalling Pickles
OWASP AppSecCali 2015 - Marshalling Pickles
 
Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Vert.X mini-talk
Vert.X mini-talkVert.X mini-talk
Vert.X mini-talk
 
Know your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvmKnow your platform. 7 things every scala developer should know about jvm
Know your platform. 7 things every scala developer should know about jvm
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 
Drupal MySQL Cluster
Drupal MySQL ClusterDrupal MySQL Cluster
Drupal MySQL Cluster
 
Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84The Ring programming language version 1.2 book - Part 83 of 84
The Ring programming language version 1.2 book - Part 83 of 84
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
 

Ähnlich wie Hippo meetup: enterprise search with Solr and elasticsearch

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaTed Dunning
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4thelabdude
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta MapR Technologies
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer serviceZang Donal
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1medcl
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Sneeker Yeh
 
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Lucidworks
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environmentlucenerevolution
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraRobbie Strickland
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 

Ähnlich wie Hippo meetup: enterprise search with Solr and elasticsearch (20)

DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValuesColumn Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
HPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with KattaHPTS talk on micro-sharding with Katta
HPTS talk on micro-sharding with Katta
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
No sql & dq2 tracer service
No sql & dq2 tracer serviceNo sql & dq2 tracer service
No sql & dq2 tracer service
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
 
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
Distributed Search in Riak - Integrating Search in a NoSQL Database: Presente...
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Roscon2021 Executor
Roscon2021 ExecutorRoscon2021 Executor
Roscon2021 Executor
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Hadoop For OpenStack Log Analysis
Hadoop For OpenStack Log AnalysisHadoop For OpenStack Log Analysis
Hadoop For OpenStack Log Analysis
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Hippo meetup: enterprise search with Solr and elasticsearch

  • 1. 15th January 2013 – Hippo meetup Luca Cavanna Software developer & Search consultant at Trifork Amsterdam luca.cavanna@trifork.nl - @lucacavanna
  • 2. Trifork (aka Jteam/Dutchworks/Orange11) Focus areas: – Big data & Search – Mobile – Custom solutions – Knowledge (GOTO Amsterdam) ● Hippo partner ● Hippo related search projects: – uva.nl – working on rijksoverheid.nl
  • 3. Agenda ● Search introduction – Lucene foundation – Why do we need Solr or elasticsearch? ● Scaling with Solr ● Elasticsearch distributed nature ● Elasticsearch features
  • 4. Apache Lucene ● High-performance, full-featured text search engine library written entirely in Java ● It indexes documents as collections of fields ● A field is a string based key-value pair ● What data structure does it use under the hood?
  • 5. Inverted index term freq Posting list 1 The old night keeper keeps the keep in the town and 1 6 big 2 23 2 In the big old house in the big old gown. dark 1 6 3 The house in the town had the big old keep did 1 4 grown 1 2 4 Where the old night keeper never did sleep. had 1 3 house 2 23 5 The night keeper keeps the keep in the night in 5 12356 6 And keeps in the dark and sleeps in the light. keep 3 135 keeper 3 145 keeps 3 156 light 1 6 never 1 4 night 3 145 old 4 1234 sleep 1 4 sleeps 1 6 the 6 123456 town 2 13 where 1 4
  • 6. Inverted index ● Indexing – Text analysis ● Tokenization, lowercasing and more ● The inverted index can contain more data – Term offsets and more ● The inverted index itself doesn't contain the text for displaying the search results
  • 7. Indexing ● Lucene writes indexes as segments ● Segments are not modifiable: Write-Once ● Each segment is a searchable mini index ● Each segment contains – Inverted index – Stored fields – ...and more
  • 8. Indexing: the commit operation ● Documents are searchable only after a commit! ● Commit gives also durability ● The most expensive operation in Lucene!!!
  • 9. Near-real-time search (since Lucene 2.9, exposed in Solr 4.0) ● With the Lucene near-real time API you don't need a commit to make new documents searchable ● Less expensive than commit ● Doesn't guarantee durability though ● Exposed as soft commit in Solr 4.0
  • 10. Lucene code example – indexing data IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40)); Directory directory = FSDirectory.open(new File("data")); IndexWriter writer = new IndexWriter(directory, config); Document document = new Document(); FieldType idFieldType = new FieldType(); idFieldType.setIndexed(true); idFieldType.setStored(true); idFieldType.setTokenized(false); document.add(new Field("id","id-1", idFieldType)); FieldType titleFieldType = new FieldType(); titleFieldType.setIndexed(true); titleFieldType.setStored(true); document.add(new Field("title","This is the title", titleFieldType)); FieldType descriptionFieldType = new FieldType(); descriptionFieldType.setIndexed(true); document.add(new Field("description","This is the description", descriptionFieldType)); writer.addDocument(document); writer.close();
  • 11. Lucene code example – querying and showing results QueryParser queryParser = new QueryParser(Version.LUCENE_40, "title", new StandardAnalyzer(Version.LUCENE_40)); Query query = queryParser.parse(queryAsString); Directory directory = FSDirectory.open(new File("data")); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); TopDocs topDocs = indexSearcher.search(query, 10); System.out.println("Total hits: " + topDocs.totalHits); for (ScoreDoc hit : topDocs.scoreDocs) { Document document = indexSearcher.doc(hit.doc); for (IndexableField field : document) { System.out.println(field.name() + ": " + field.stringValue()); } }
  • 12. What's missing? ● A common way to represent documents ● Interface to send document to (HTTP) ● A way to represent queries ● Interface to send queries to (HTTP) ● Configuration ● Caching ● Distributed infrastructure ● And more....
  • 14. Scaling – why? ‣ The more concurrent searches you run, the slower they get ‣ Indexing and searching on the same machine will substantially harm search performance ‣ Segment merging may be CPU/IO intensive operations ‣ Disk cache invalidation ‣ Fail over
  • 16. Solr replication (pull approach) • Master-slave based solution • Single machine for indexing data (master) • Multiple machines for querying (slaves) • Master is not aware of the slaves • Slave is aware of the master • Load balancer responsible for balancing the query requests • What about real-time search? No way!
  • 17. SolrCloud • A set of new distributed capabilities in Solr • uses Apache Zookeeper as a system of record for the cluster state, for central configuration, and for leader election • Whatever server (shard) you send data to: • the documents get distributed over the shards • A shard can be a leader or a replica and contains a subset of the data • Easily scale up adding new Solr nodes
  • 18. elasticsearch ● Distributed search engine built on top of Lucene ● Apache 2 license ● Written in Java ● RESTful ● Created and mainly developed by Shay Banon ● A company behind it: elasticsearch.com ● Regular releases – Latest release 0.20.2
  • 19. elasticsearch ● Schemaless – Uses defaults and automatic type guessing – Custom mappings may be defined if needed ● JSON oriented ● Multi tenancy – Multiple indexes per node, multiple types per index ● Designed to be distributed from the beginning ● Almost everything is available as API (including configuration) ● Wide range of administration APIs
  • 20. elasticsearch distributed terminology ● Node: a running instance of elasticsearch which belongs to a cluster (usually one node per server) ● Cluster: one or more nodes with the same cluster name ● Shard: a single Lucene instance. A low-level worker unit managed by elasticsearch. An index is split into one or more shards. ● Index: a logical namespace which points to one or more shards – Your code won't deal directly with a shard, only with an index – But an index is composed of more lucene indexes (one per shard)
  • 21. elasticsearch distributed terminology ● More shards: – improve indexing performance – increase data distribution (depends on # of nodes) – Watch out: each shard has a cost as well! ● More replicas: – increase failover – improve querying performance
  • 22. Transaction Log • Indexed docs are fully persistent • No need for a Lucene IndexWriter#commit • Managed using a transaction log / WAL • Full single node durability (kill dash 9) • Utilized when doing hot relocation of shards • Periodically “flushed” (calling IW#commit) • Durability and real time search together!
  • 23. Index - Shards & Replicas Node Node curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  • 24. Index - Shards & Replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_shards" : 2, "number_of_replicas" : 1 } }'
  • 25. Indexing - 1 • Automatic sharding, push replication Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/1 -d ' { "name" : { "first" : "Luca", Client "last" : "Cavanna" } }'
  • 26. Indexing - 2 Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) curl -XPUT localhost:9200/hippo/users/2 -d ' { "name" : { Client "first" : "Jeroen", "last" : "Reijn" } }'
  • 27. Search - 1 • Scatter / Gather search Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 28. Search - 2 • Automatic balancing between replicas Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 29. Search - 3 • Automatic failover Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) failure (primary) Client curl -XPUT localhost:9200/hippo/_search?q=luca
  • 30. Adding a node • “Hot” reallocation of shards to the new node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 31. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 32. Adding a node • “Hot” reallocation of shards to the new node Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) Shard 1 Shard 1 (replica) (primary)
  • 33. Node failure Node Node Node Shard 0 Shard 0 (primary) (replica) Shard 1 Shard 1 (replica) (primary)
  • 34. Node failure - 1 • Replicas can automatically become primaries Node Node Shard 0 (primary) Shard 1 (primary)
  • 35. Node failure - 2 • Shards are automatically assigned and do “hot” recovery Node Node Shard 0 Shard 0 (replica) (primary) Shard 1 Shard 1 (primary) (replica)
  • 36. Dynamic Replicas Node Node Node Shard 0 Shard 0 (primary) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 Client } }'
  • 37. Dynamic Replicas Node Node Node Shard 0 Shard 0 Shard 0 (primary) (replica) (replica) curl -XPUT localhost:9200/hippo -d ' { "index" : { Client "number_of_replicas" : 2 } }'
  • 38. Indexing (Push) - ElasticSearch • Documents added through push requests • Full JSON Object representation of Documents supported • Embedded objects • 1st class Parent / Child and Versioning • Near Realtime index refreshing available • Realtime get supported { "name": "Luca Cavanna", "location": { "city": "Amsterdam", "country": "The Netherlands" } }
  • 39. Indexing (Pull) - ElasticSearch • Data flows from sources using ‘Rivers’ • Continues to add data as it ‘flows’ • Can be added, removed, configured dynamically • Out-of-the-box support for CouchDB, Twitter (implemented by the es team) • Community implementations for DBs, other NoSQL and Solr River River
  • 40. Searching - ElasticSearch • Search request in Request Body • Powerful and extensible Query DSL • Separation of Query and Filters • Named Filters allowing tracking of which Documents matched which Filters • By default storing the source of each document (_source field) • Catch all feature enabled by default (_all field) • Sorting of results • Highlighting, Faceting, Boosting...and more
  • 41. Search Example - ElasticSearch $ curl -XGET 'http://localhost:9200/hippo/users/_search' -d ' { "query" : { { "term" : { "first_name" : "luca" } "_shards": { } "total" : 5, }' "successful" : 5, "failed" : 0 }, "hits": { "total" : 1, "hits" : [ { "_index" : "hippo", "_type" : "users", "_id" : "1", "_source" : { "first_name" : "Luca", "last_name" : "Cavanna" } } ] } }
  • 42. Thanks There would be a lot more to say: • Query DSL • Scripting module (pluggable implementation) • Percolator • Running it embedded Check them out yourself if you are interested! Questions?