SlideShare a Scribd company logo
1 of 36
Download to read offline
Zend PHP Conference - Santa Clara, US
      Derick Rethans - dr@ez.no
   http://derickrethans.nl/talks.php
About Me

●   Dutchman living in Norway
●   eZ Systems A.S.
●   eZ Components project lead
●   PHP development
●   mcrypt, input_filter, date/time support, unicode
●   QA
Introduction
Indexing




Before you can search, you need to index.


Indexing requires:
   ● Finding the documents to index (crawl)

   ● Separate the documents into indexable units


     (tokenizing)
   ● Massage the found units (stemming)
Crawling




●   Domain specific: file system, CMS, Google
●   Should indicate different fields of a document: title,
    description, meta-tags, body
Tokenizing

Making indexing parts out of text.
Text
quot;This standard was developed from ISO/IEC 9075:1989quot;

Whitespace:
quot;Thisquot; quot;standardquot; quot;wasquot; quot;developedquot; quot;fromquot; quot;ISO/IECquot; quot;9075:1989quot;

Continuous letters:
quot;Thisquot; quot;standardquot; quot;wasquot; quot;developedquot; quot;fromquot; quot;ISOquot; quot;IECquot;

HTML
quot;<li><em>If it exists</em>, the STATUS of the W3C document.</li>quot;

quot;Ifquot; quot;itquot; quot;existsquot; quot;thequot; quot;statusquot; quot;ofquot; quot;thequot; quot;w3cquot; quot;documentquot;
Tokenizing
                Domain specific

Tokenization is domain specific
  ● You don't always want to split up letters from numbers -


    f.e. in product numbers.
  ● You might want to exclude words (stop words)

  ● You might want to filter out words that are short, or just


    long
  ● You might want to define synonyms

  ● You might want to normalize text (remove accents,


    Unicode forms)
Tokenizing
                 Japanese

There is little interpunction:
辞書 , コーパスに依存しない汎用的な設計
You need special techniques to split it up into bits. Tools like
Kakasi and Mecab.
Output from mecab:
辞書 , コーパスに依存しない汎用的な設計
辞書     名詞 , 普通名詞 ,*,*, 辞書 , じしょ , 代表表記 : 辞書
,       特殊 , 記号 ,*,*,*,*,*
コーパス 名詞 , 普通名詞 ,*,*,*,*,*
に        助詞 , 格助詞 ,*,*, に , に ,*
依存     名詞 , サ変 名詞 ,*,*, 依存 , いぞん , 代表表記 : 依存
し        動詞 ,*, サ変動詞 , 基本連用形 , する , し , 付属動詞候補(基本) 
代表表記 : する
ない     接尾辞 , 形容詞性述語接尾辞 , イ形容詞アウオ段 , 基本形 , な
い , ない ,*
汎用     名詞 , サ変名詞 ,*,*, 汎用 , はんよう , 代表表記 : 汎用
Stemming

Stemming normalizes words:
   ● Porter stemming

   ● It's language dependent: snowball

   ● Several algorithms exist
arrival -> arrive
skies -> sky
riding -> ride
rides -> ride
horses -> hors




Alternatively, instead of word analysis you can use quot;sounds
likequot; indexing, by using something like soundex or
metaphone:
Word     Soundex       Metaphone
stemming S355       STMNK
peas     P200       PS
peace     P200       PS
please    P420       PLS
Stemming
                    Japanese

踊る          odoru    dance

踊らない      odoranai     doesn't dance

踊った         odotta     danced

踊らなかった  odoranakatta didn't dance
踊れる         odoreru  can dance

踊れない      odorenai     can't dance

踊れた         odoreta  could dance

踊れなかった  odorenakatta couldn't dance
踊っている      odotteiru       is dancing

踊っていない  odotteinai         isn't dancing
Searching

Different types of searches
   ● Search words, phrases, boolean: airplane, quot;red winequot;,


     wine -red
   ● Field searches: title:tutorial desc:feed

   ● Facetted search:


<a href='http://dev.ezcomponents.org/search'>demo</a>
Different Methods
MySQL FULLTEXT

MySQL has support for full-text indexing and searching:
  ● A full-text index in MySQL is an index of type


    FULLTEXT.
  ● Full-text indexes can be used only with MyISAM tables,


    and can be created only for CHAR, VARCHAR, or
    TEXT columns.
MySQL FULLTEXT
                           Types

Boolean search:
mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                 |
+----+-----------------------+-------------------------------------+
| 1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
| 2 | How To Use MySQL Well | After you went through a ...        |
| 3 | Optimizing MySQL     | In this tutorial we will show ... |
| 4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
| 6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+

Natural search (with or without query expansion):
mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                      |
+----+-------------------+------------------------------------------+
| 5 | MySQL vs. YourSQL | In the following database comparison ... |
| 1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
MySQL FULLTEXT

   ●  Full-text searches are supported for MyISAM tables
      only.
   ● Full-text searches can be used with multi-byte character


      sets. The exception is that for Unicode, the UTF-8
      character set can be used, but not the ucs2 character
      set.
   ● Ideographic languages such as Chinese and Japanese


      do not have word delimiters. Therefore, the FULLTEXT
      parser cannot determine where words begin and end in
      these and other such languages.
   ● Although the use of multiple character sets within a


      single table is supported, all columns in a FULLTEXT
      index must use the same character set and collation.
From http://dev.mysql.com/doc/refman/5.0/en/fulltext-
restrictions.html
Own Implementation

●   We store content differently, so the FULLTEXT MySQL
    approach doesn't work
●   We need support for CJK
●   We needed support for other databases
Own Implementation

   ●   Split text into tokens
   ●   Store tokens in a table, uniquely - with frequency
   ●   Store object / word location in a table, next/prev word,
       order
mysql> select * from ezsearch_word order by word limit 250, 4;
+------+--------------+--------------+
| id | object_count | word         |
+------+--------------+--------------+
| 1761 |            1 | associations |
| 2191 |            1 | assurance    |
| 349 |         37 | at         |
+------+--------------+--------------+

mysql> select word_id, word from ezsearch_object_word_link wl, ezsearch_word w
where wl.word_id = w.id and contentobject_id = 145 order by placement limit 4;
+---------+--------+
| word_id | word |
+---------+--------+
|    1576 | puts |
|     926 | europe |
|     349 | at     |
+---------+--------+
Own Implementation

Be careful to split with regular expressions - character set
and locale issues:
<pre><?php
setlocale( LC_ALL, 'nb_NO.utf8');
var_dump( preg_split( '/W/u', 'blårbærøl er greit' ) );
?>
Own Implementation

Be careful to split with regular expressions - character set
and locale issues:
<pre>
<?php
$string = 'blårbærøl er greit';
$string = iconv( 'utf-8', 'latin1', $string );
setlocale( LC_ALL, 'nb_NO.iso-8859-1');
var_dump( preg_split( '/W/', $string ) );
?>
Own Implementation

Doesn't work very well for huge amounts of content:
mysql> use ezno;
Database changed

mysql> select count(*) from ezsearch_word;
212291

mysql> select count(*) from ezsearch_object_word_link;
15551310
Apache Lucene

Apache Lucene is a high-performance, full-featured text
search engine library written entirely in Java. It is a
technology suitable for nearly any application that requires
full-text search, especially cross-platform.

   ●   Implemented in Java
   ●   Provides indexing and searching libraries
   ●   Ranked searching -- best results returned first
   ●   Many powerful query types: phrase queries, wildcard
       queries, proximity queries, range queries and more
   ●   Fielded searching (e.g., title, author, contents)
   ●   Date-range searching
   ●   Sorting by any field
Zend Lucene

It's a port of Java Lucene to PHP

   ●   Compatible with the Lucene index format
   ●   Provides indexing and searching libraries
   ●   Supports some of the lucene query language
   ●   Support for indexing HTML documents: title, meta and
       body
   ●   Has support for different field types:
   ●   Keyword: Not tokenized
   ●   UnIndexed: Not indexed
   ●   Binary: Binary data
   ●   Text: Tokenized
   ●   UnStored: Tokenized, but only indexed
Zend Lucene
                           Indexing Example

Normal:
<?php
// Open existing index
$index = Zend_Search_Lucene::open('/data/my-index');

$doc = new Zend_Search_Lucene_Document();
// Store document URL to identify it in search result.
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
// Index document content
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));

// Add document to the index.
$index->addDocument($doc);
?>

HTML document:
<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
$index->addDocument($doc);
?>
Zend Lucene
                           Search Example

With parser:
<?php
$index = Zend_Search_Lucene::open('/data/my-index');
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
$hits = $index->find($query);
?>

With API:
<?php
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);

$pathTerm = new Zend_Search_Lucene_Index_Term('/data/doc_dir/' . $filename, 'path');
$pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm);

$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($userQuery);
$query->addSubquery($pathQuery);

$hits = $index->find($query);
?>
Apache Solr

Solr is a standalone enterprise search server with a web-
services like API. It extends Lucene:
   ● Real schema, with numeric types, dynamic fields,


     unique keys
   ● Powerful extensions to the lucene query language

   ● Support for dynamic faceted browsing and filtering

   ● Advanced, configurable text analysis

   ● Highly configurable and user extensible caching

   ● Performance optimizations

   ● External configuration via xml

   ● An administration interface

   ● Monitorable logging

   ● Fast incremental updates and snapshot distribution

   ● XML and CSV/delimited-text update formats


<a href='http://localhost:8983/solr/admin/'>demo</a>
Marjory

Marjory is a webservice for indexing and searching for
documents, utilizing a full-text search engine.
  ● It is somewhat similar to Solr, but is written in PHP and


    the underlying architecture allows for using search
    engines other than Lucene (no other adaptor is
    implemented yet, though).
  ● Marjory is based on the Zend Framework and uses


    Zend_Search_Lucene as the default search engine.
eZ Components' Search component

●   As our current solution doesn't scale, we needed
    something new
●   As Lucene is really good, the first attempt was to use
    the Java bridge
●   Then Solr came out
●   Which we integrated in quot;eZ Findquot;
●   Next versions of eZ Publish will use eZ Components
eZ Components' Search component
               Requirements and Design

●   Support for multiple backends
●   Abstract documents
●   Support for datatypes
●   Rich searching API, including facetted search
●   Easy query interface
eZ Components' Search component
                            Document Definition
<?php
static public function getDefinition()
{
    $n = new ezcSearchDocumentDefinition( 'ezcSearchSimpleArticle' );
    $n->idProperty = 'id';
    $n->fields['id'] =
        new ezcSearchDefinitionDocumentField(
            'id', ezcSearchDocumentDefinition::TEXT );

     $n->fields['title'] =
         new ezcSearchDefinitionDocumentField(
             'title', ezcSearchDocumentDefinition::TEXT,
             2, true, false, true );

     $n->fields['body'] =
         new ezcSearchDefinitionDocumentField(
             'body', ezcSearchDocumentDefinition::TEXT,
             1, false, false, true );

     $n->fields['published'] =
         new ezcSearchDefinitionDocumentField(
             'published', ezcSearchDocumentDefinition::DATE );

     $n->fields['url'] =
         new ezcSearchDefinitionDocumentField(
             'url', ezcSearchDocumentDefinition::STRING );

     $n->fields['type'] =
         new ezcSearchDefinitionDocumentField(
             'type', ezcSearchDocumentDefinition::STRING,
             0, true, false, false );

     return $n;
}
?>
eZ Components' Search component
                           Document Definition, in XML
<?xml version=quot;1.0quot;?>
<document>
    <field type=quot;idquot;>id</field>
    <field type=quot;textquot; boost=quot;2quot;>title</field>
    <field type=quot;textquot;>summary</field>
    <field inResult=quot;falsequot; type=quot;htmlquot;>body</field>
    <field type=quot;datequot;>published</field>
    <field type=quot;stringquot; multi=quot;truequot;>author</field>
</document>




Setting up the manager:
<?php
$backend = new ezcSearchSolrHandler;

$session = new ezcSearchSession(
    $backend,
    new ezcSearchXmlManager( $testFilesDir )
);
?>
eZ Components' Search component
                           Indexing
<?php
$session = new ezcSearchSession(
    $backend,
    new ezcSearchXmlManager( $testFilesDir )
);

$a = new Article(
    null, // id
    'Test Article', // title
    'This is an article to test', // description
    'the body of the article', // body
    time() // published
);
$session->index( $a );
?>
eZ Components' Search component
                             Search - API
<?php
$session = new ezcSearchSession(
    $backend,
    new ezcSearchXmlManager( $testFilesDir )
);

$q = $session->createFindQuery( 'Article' );
$q->where( $q->eq( 'title', 'Article' ) );
->limit( 5 );
->orderBy( 'id' );

$r = $session->find( $q );

foreach ( $r->documents as $document )
{
    echo $document['document']->title, quot;nquot;;
}
?>
eZ Components' Search component
                           Search - Query Builder
<?php
$session = new ezcSearchSession(
    $backend,
    new ezcSearchXmlManager( $testFilesDir )
);

$q = $session->createFindQuery( 'Article' );

new ezcSearchQueryBuilder(
    $q,
    'thunderball',
    array( 'fieldOne', 'fieldTwo' )
);

$q->facet( 'title' ); // keyword data field

$r = $session->find( $q );
foreach ( $r->documents as $document )
{
    echo $document['document']->title, quot;nquot;;
}
?>
Notes on Performance
               Solr vs Zend Lucene

●   Both engines are not tuned
●   Solr indexes about 25% faster for small documents (one
    sentence)
●   Solr indexes about 200% faster for big documents
    (64kb)
Future improvements

●   More backends: google, marjory, sphinx, xapian, yahoo
●   More features: SpellChecker, MoreLikeThis
Resources




Porter algoritihm:
http://telemat.det.unifi.it/book/2001/wchange/download/stem
_porter.html
Solr: http://lucene.apache.org/solr/
Zend Lucene:
http://framework.zend.com/manual/en/zend.search.lucene.ht
ml
SnowBall: http://snowball.tartarus.org/
Mecab (In Japanese): http://mecab.sourceforge.net/
These Slides: http://derickrethans.nl/talks.php

More Related Content

What's hot

0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269tutorialsruby
 
Doctrine 2
Doctrine 2Doctrine 2
Doctrine 2zfconfua
 
Let's write secure Drupal code! DUG Belgium - 08/08/2019
Let's write secure Drupal code! DUG Belgium - 08/08/2019Let's write secure Drupal code! DUG Belgium - 08/08/2019
Let's write secure Drupal code! DUG Belgium - 08/08/2019Balázs Tatár
 
Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统yiditushe
 
Let's write secure drupal code! - Drupal Camp Pannonia 2019
Let's write secure drupal code! - Drupal Camp Pannonia 2019Let's write secure drupal code! - Drupal Camp Pannonia 2019
Let's write secure drupal code! - Drupal Camp Pannonia 2019Balázs Tatár
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1medcl
 
Database Connection With Mysql
Database Connection With MysqlDatabase Connection With Mysql
Database Connection With MysqlHarit Kothari
 
Let's write secure Drupal code! - Drupal Camp Poland 2019
Let's write secure Drupal code! - Drupal Camp Poland 2019Let's write secure Drupal code! - Drupal Camp Poland 2019
Let's write secure Drupal code! - Drupal Camp Poland 2019Balázs Tatár
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installationhabeebulla g
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2rusersla
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solrLucidworks (Archived)
 
Quick reference for mongo shell commands
Quick reference for mongo shell commandsQuick reference for mongo shell commands
Quick reference for mongo shell commandsRajkumar Asohan, PMP
 
Boosting MongoDB performance
Boosting MongoDB performanceBoosting MongoDB performance
Boosting MongoDB performanceAlexei Panin
 

What's hot (18)

veracruz
veracruzveracruz
veracruz
 
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa0602690047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
0047ecaa6ea3e9ac0a13a2fe96f4de3bfd515c88f5d90c1fae79b956363d7f02c7fa060269
 
Doctrine 2
Doctrine 2Doctrine 2
Doctrine 2
 
Let's write secure Drupal code! DUG Belgium - 08/08/2019
Let's write secure Drupal code! DUG Belgium - 08/08/2019Let's write secure Drupal code! DUG Belgium - 08/08/2019
Let's write secure Drupal code! DUG Belgium - 08/08/2019
 
Facebook的缓存系统
Facebook的缓存系统Facebook的缓存系统
Facebook的缓存系统
 
Let's write secure drupal code! - Drupal Camp Pannonia 2019
Let's write secure drupal code! - Drupal Camp Pannonia 2019Let's write secure drupal code! - Drupal Camp Pannonia 2019
Let's write secure drupal code! - Drupal Camp Pannonia 2019
 
Mysql
MysqlMysql
Mysql
 
New tags in html5
New tags in html5New tags in html5
New tags in html5
 
Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1Elastic Search Training#1 (brief tutorial)-ESCC#1
Elastic Search Training#1 (brief tutorial)-ESCC#1
 
MYSQL - PHP Database Connectivity
MYSQL - PHP Database ConnectivityMYSQL - PHP Database Connectivity
MYSQL - PHP Database Connectivity
 
Database Connection With Mysql
Database Connection With MysqlDatabase Connection With Mysql
Database Connection With Mysql
 
lab56_db
lab56_dblab56_db
lab56_db
 
Let's write secure Drupal code! - Drupal Camp Poland 2019
Let's write secure Drupal code! - Drupal Camp Poland 2019Let's write secure Drupal code! - Drupal Camp Poland 2019
Let's write secure Drupal code! - Drupal Camp Poland 2019
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solr
 
Quick reference for mongo shell commands
Quick reference for mongo shell commandsQuick reference for mongo shell commands
Quick reference for mongo shell commands
 
Boosting MongoDB performance
Boosting MongoDB performanceBoosting MongoDB performance
Boosting MongoDB performance
 

Similar to Of Haystacks And Needles

Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni
 
Internationalizing CakePHP Applications
Internationalizing CakePHP ApplicationsInternationalizing CakePHP Applications
Internationalizing CakePHP ApplicationsPierre MARTIN
 
Service discovery and configuration provisioning
Service discovery and configuration provisioningService discovery and configuration provisioning
Service discovery and configuration provisioningSource Ministry
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup Alberto Paro
 
MySQL 8.0.17 - New Features Summary
MySQL 8.0.17 - New Features SummaryMySQL 8.0.17 - New Features Summary
MySQL 8.0.17 - New Features SummaryOlivier DASINI
 
2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQLHung-yu Lin
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersJonathan Levin
 
Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Michelangelo van Dam
 
Performance Optimization and JavaScript Best Practices
Performance Optimization and JavaScript Best PracticesPerformance Optimization and JavaScript Best Practices
Performance Optimization and JavaScript Best PracticesDoris Chen
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHPRob Knight
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small StartHiroshi Toyama
 
Resource Routing in ExpressionEngine
Resource Routing in ExpressionEngineResource Routing in ExpressionEngine
Resource Routing in ExpressionEngineMichaelRog
 
SYN507: Reducing desktop infrastructure management overhead using “old school...
SYN507: Reducing desktop infrastructure management overhead using “old school...SYN507: Reducing desktop infrastructure management overhead using “old school...
SYN507: Reducing desktop infrastructure management overhead using “old school...Denis Gundarev
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 

Similar to Of Haystacks And Needles (20)

Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
 
Internationalizing CakePHP Applications
Internationalizing CakePHP ApplicationsInternationalizing CakePHP Applications
Internationalizing CakePHP Applications
 
Service discovery and configuration provisioning
Service discovery and configuration provisioningService discovery and configuration provisioning
Service discovery and configuration provisioning
 
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup ElasticSearch 5.x -  New Tricks - 2017-02-08 - Elasticsearch Meetup
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
 
MySQL 8.0.17 - New Features Summary
MySQL 8.0.17 - New Features SummaryMySQL 8.0.17 - New Features Summary
MySQL 8.0.17 - New Features Summary
 
Percona toolkit
Percona toolkitPercona toolkit
Percona toolkit
 
Download It
Download ItDownload It
Download It
 
Logstash
LogstashLogstash
Logstash
 
2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL2014 database - course 3 - PHP and MySQL
2014 database - course 3 - PHP and MySQL
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Scaling MySQL Strategies for Developers
Scaling MySQL Strategies for DevelopersScaling MySQL Strategies for Developers
Scaling MySQL Strategies for Developers
 
Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010
 
Performance Optimization and JavaScript Best Practices
Performance Optimization and JavaScript Best PracticesPerformance Optimization and JavaScript Best Practices
Performance Optimization and JavaScript Best Practices
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Presto Meetup 2016 Small Start
Presto Meetup 2016 Small StartPresto Meetup 2016 Small Start
Presto Meetup 2016 Small Start
 
Resource Routing in ExpressionEngine
Resource Routing in ExpressionEngineResource Routing in ExpressionEngine
Resource Routing in ExpressionEngine
 
Os Wilhelm
Os WilhelmOs Wilhelm
Os Wilhelm
 
SYN507: Reducing desktop infrastructure management overhead using “old school...
SYN507: Reducing desktop infrastructure management overhead using “old school...SYN507: Reducing desktop infrastructure management overhead using “old school...
SYN507: Reducing desktop infrastructure management overhead using “old school...
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 

More from ZendCon

Framework Shootout
Framework ShootoutFramework Shootout
Framework ShootoutZendCon
 
Zend_Tool: Practical use and Extending
Zend_Tool: Practical use and ExtendingZend_Tool: Practical use and Extending
Zend_Tool: Practical use and ExtendingZendCon
 
PHP on IBM i Tutorial
PHP on IBM i TutorialPHP on IBM i Tutorial
PHP on IBM i TutorialZendCon
 
PHP on Windows - What's New
PHP on Windows - What's NewPHP on Windows - What's New
PHP on Windows - What's NewZendCon
 
PHP and Platform Independance in the Cloud
PHP and Platform Independance in the CloudPHP and Platform Independance in the Cloud
PHP and Platform Independance in the CloudZendCon
 
I18n with PHP 5.3
I18n with PHP 5.3I18n with PHP 5.3
I18n with PHP 5.3ZendCon
 
Cloud Computing: The Hard Problems Never Go Away
Cloud Computing: The Hard Problems Never Go AwayCloud Computing: The Hard Problems Never Go Away
Cloud Computing: The Hard Problems Never Go AwayZendCon
 
Planning for Synchronization with Browser-Local Databases
Planning for Synchronization with Browser-Local DatabasesPlanning for Synchronization with Browser-Local Databases
Planning for Synchronization with Browser-Local DatabasesZendCon
 
Magento - a Zend Framework Application
Magento - a Zend Framework ApplicationMagento - a Zend Framework Application
Magento - a Zend Framework ApplicationZendCon
 
Enterprise-Class PHP Security
Enterprise-Class PHP SecurityEnterprise-Class PHP Security
Enterprise-Class PHP SecurityZendCon
 
PHP and IBM i - Database Alternatives
PHP and IBM i - Database AlternativesPHP and IBM i - Database Alternatives
PHP and IBM i - Database AlternativesZendCon
 
Zend Core on IBM i - Security Considerations
Zend Core on IBM i - Security ConsiderationsZend Core on IBM i - Security Considerations
Zend Core on IBM i - Security ConsiderationsZendCon
 
Application Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server TracingApplication Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server TracingZendCon
 
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...ZendCon
 
Solving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and ScalabilitySolving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and ScalabilityZendCon
 
Joe Staner Zend Con 2008
Joe Staner Zend Con 2008Joe Staner Zend Con 2008
Joe Staner Zend Con 2008ZendCon
 
Tiery Eyed
Tiery EyedTiery Eyed
Tiery EyedZendCon
 
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...ZendCon
 
DB2 Storage Engine for MySQL and Open Source Applications Session
DB2 Storage Engine for MySQL and Open Source Applications SessionDB2 Storage Engine for MySQL and Open Source Applications Session
DB2 Storage Engine for MySQL and Open Source Applications SessionZendCon
 
Digital Identity
Digital IdentityDigital Identity
Digital IdentityZendCon
 

More from ZendCon (20)

Framework Shootout
Framework ShootoutFramework Shootout
Framework Shootout
 
Zend_Tool: Practical use and Extending
Zend_Tool: Practical use and ExtendingZend_Tool: Practical use and Extending
Zend_Tool: Practical use and Extending
 
PHP on IBM i Tutorial
PHP on IBM i TutorialPHP on IBM i Tutorial
PHP on IBM i Tutorial
 
PHP on Windows - What's New
PHP on Windows - What's NewPHP on Windows - What's New
PHP on Windows - What's New
 
PHP and Platform Independance in the Cloud
PHP and Platform Independance in the CloudPHP and Platform Independance in the Cloud
PHP and Platform Independance in the Cloud
 
I18n with PHP 5.3
I18n with PHP 5.3I18n with PHP 5.3
I18n with PHP 5.3
 
Cloud Computing: The Hard Problems Never Go Away
Cloud Computing: The Hard Problems Never Go AwayCloud Computing: The Hard Problems Never Go Away
Cloud Computing: The Hard Problems Never Go Away
 
Planning for Synchronization with Browser-Local Databases
Planning for Synchronization with Browser-Local DatabasesPlanning for Synchronization with Browser-Local Databases
Planning for Synchronization with Browser-Local Databases
 
Magento - a Zend Framework Application
Magento - a Zend Framework ApplicationMagento - a Zend Framework Application
Magento - a Zend Framework Application
 
Enterprise-Class PHP Security
Enterprise-Class PHP SecurityEnterprise-Class PHP Security
Enterprise-Class PHP Security
 
PHP and IBM i - Database Alternatives
PHP and IBM i - Database AlternativesPHP and IBM i - Database Alternatives
PHP and IBM i - Database Alternatives
 
Zend Core on IBM i - Security Considerations
Zend Core on IBM i - Security ConsiderationsZend Core on IBM i - Security Considerations
Zend Core on IBM i - Security Considerations
 
Application Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server TracingApplication Diagnosis with Zend Server Tracing
Application Diagnosis with Zend Server Tracing
 
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...
Insights from the Experts: How PHP Leaders Are Transforming High-Impact PHP A...
 
Solving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and ScalabilitySolving the C20K problem: Raising the bar in PHP Performance and Scalability
Solving the C20K problem: Raising the bar in PHP Performance and Scalability
 
Joe Staner Zend Con 2008
Joe Staner Zend Con 2008Joe Staner Zend Con 2008
Joe Staner Zend Con 2008
 
Tiery Eyed
Tiery EyedTiery Eyed
Tiery Eyed
 
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...
Make your PHP Application Software-as-a-Service (SaaS) Ready with the Paralle...
 
DB2 Storage Engine for MySQL and Open Source Applications Session
DB2 Storage Engine for MySQL and Open Source Applications SessionDB2 Storage Engine for MySQL and Open Source Applications Session
DB2 Storage Engine for MySQL and Open Source Applications Session
 
Digital Identity
Digital IdentityDigital Identity
Digital Identity
 

Recently uploaded

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Recently uploaded (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Of Haystacks And Needles

  • 1. Zend PHP Conference - Santa Clara, US Derick Rethans - dr@ez.no http://derickrethans.nl/talks.php
  • 2. About Me ● Dutchman living in Norway ● eZ Systems A.S. ● eZ Components project lead ● PHP development ● mcrypt, input_filter, date/time support, unicode ● QA
  • 4. Indexing Before you can search, you need to index. Indexing requires: ● Finding the documents to index (crawl) ● Separate the documents into indexable units (tokenizing) ● Massage the found units (stemming)
  • 5. Crawling ● Domain specific: file system, CMS, Google ● Should indicate different fields of a document: title, description, meta-tags, body
  • 6. Tokenizing Making indexing parts out of text. Text quot;This standard was developed from ISO/IEC 9075:1989quot; Whitespace: quot;Thisquot; quot;standardquot; quot;wasquot; quot;developedquot; quot;fromquot; quot;ISO/IECquot; quot;9075:1989quot; Continuous letters: quot;Thisquot; quot;standardquot; quot;wasquot; quot;developedquot; quot;fromquot; quot;ISOquot; quot;IECquot; HTML quot;<li><em>If it exists</em>, the STATUS of the W3C document.</li>quot; quot;Ifquot; quot;itquot; quot;existsquot; quot;thequot; quot;statusquot; quot;ofquot; quot;thequot; quot;w3cquot; quot;documentquot;
  • 7. Tokenizing Domain specific Tokenization is domain specific ● You don't always want to split up letters from numbers - f.e. in product numbers. ● You might want to exclude words (stop words) ● You might want to filter out words that are short, or just long ● You might want to define synonyms ● You might want to normalize text (remove accents, Unicode forms)
  • 8. Tokenizing Japanese There is little interpunction: 辞書 , コーパスに依存しない汎用的な設計 You need special techniques to split it up into bits. Tools like Kakasi and Mecab. Output from mecab: 辞書 , コーパスに依存しない汎用的な設計 辞書     名詞 , 普通名詞 ,*,*, 辞書 , じしょ , 代表表記 : 辞書 , 特殊 , 記号 ,*,*,*,*,* コーパス 名詞 , 普通名詞 ,*,*,*,*,* に        助詞 , 格助詞 ,*,*, に , に ,* 依存     名詞 , サ変 名詞 ,*,*, 依存 , いぞん , 代表表記 : 依存 し        動詞 ,*, サ変動詞 , 基本連用形 , する , し , 付属動詞候補(基本)  代表表記 : する ない     接尾辞 , 形容詞性述語接尾辞 , イ形容詞アウオ段 , 基本形 , な い , ない ,* 汎用     名詞 , サ変名詞 ,*,*, 汎用 , はんよう , 代表表記 : 汎用
  • 9. Stemming Stemming normalizes words: ● Porter stemming ● It's language dependent: snowball ● Several algorithms exist arrival -> arrive skies -> sky riding -> ride rides -> ride horses -> hors Alternatively, instead of word analysis you can use quot;sounds likequot; indexing, by using something like soundex or metaphone: Word Soundex Metaphone stemming S355 STMNK peas P200 PS peace P200 PS please P420 PLS
  • 10. Stemming Japanese 踊る          odoru dance 踊らない      odoranai doesn't dance 踊った         odotta danced 踊らなかった  odoranakatta didn't dance 踊れる         odoreru can dance 踊れない      odorenai can't dance 踊れた         odoreta could dance 踊れなかった  odorenakatta couldn't dance 踊っている      odotteiru is dancing 踊っていない  odotteinai isn't dancing
  • 11. Searching Different types of searches ● Search words, phrases, boolean: airplane, quot;red winequot;, wine -red ● Field searches: title:tutorial desc:feed ● Facetted search: <a href='http://dev.ezcomponents.org/search'>demo</a>
  • 13. MySQL FULLTEXT MySQL has support for full-text indexing and searching: ● A full-text index in MySQL is an index of type FULLTEXT. ● Full-text indexes can be used only with MyISAM tables, and can be created only for CHAR, VARCHAR, or TEXT columns.
  • 14. MySQL FULLTEXT Types Boolean search: mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE); +----+-----------------------+-------------------------------------+ | id | title | body | +----+-----------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 2 | How To Use MySQL Well | After you went through a ... | | 3 | Optimizing MySQL | In this tutorial we will show ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | | 6 | MySQL Security | When configured properly, MySQL ... | +----+-----------------------+-------------------------------------+ Natural search (with or without query expansion): mysql> SELECT * FROM articles -> WHERE MATCH (title,body) AGAINST ('database'); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+
  • 15. MySQL FULLTEXT ● Full-text searches are supported for MyISAM tables only. ● Full-text searches can be used with multi-byte character sets. The exception is that for Unicode, the UTF-8 character set can be used, but not the ucs2 character set. ● Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the FULLTEXT parser cannot determine where words begin and end in these and other such languages. ● Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation. From http://dev.mysql.com/doc/refman/5.0/en/fulltext- restrictions.html
  • 16. Own Implementation ● We store content differently, so the FULLTEXT MySQL approach doesn't work ● We need support for CJK ● We needed support for other databases
  • 17. Own Implementation ● Split text into tokens ● Store tokens in a table, uniquely - with frequency ● Store object / word location in a table, next/prev word, order mysql> select * from ezsearch_word order by word limit 250, 4; +------+--------------+--------------+ | id | object_count | word | +------+--------------+--------------+ | 1761 | 1 | associations | | 2191 | 1 | assurance | | 349 | 37 | at | +------+--------------+--------------+ mysql> select word_id, word from ezsearch_object_word_link wl, ezsearch_word w where wl.word_id = w.id and contentobject_id = 145 order by placement limit 4; +---------+--------+ | word_id | word | +---------+--------+ | 1576 | puts | | 926 | europe | | 349 | at | +---------+--------+
  • 18. Own Implementation Be careful to split with regular expressions - character set and locale issues: <pre><?php setlocale( LC_ALL, 'nb_NO.utf8'); var_dump( preg_split( '/W/u', 'blårbærøl er greit' ) ); ?>
  • 19. Own Implementation Be careful to split with regular expressions - character set and locale issues: <pre> <?php $string = 'blårbærøl er greit'; $string = iconv( 'utf-8', 'latin1', $string ); setlocale( LC_ALL, 'nb_NO.iso-8859-1'); var_dump( preg_split( '/W/', $string ) ); ?>
  • 20. Own Implementation Doesn't work very well for huge amounts of content: mysql> use ezno; Database changed mysql> select count(*) from ezsearch_word; 212291 mysql> select count(*) from ezsearch_object_word_link; 15551310
  • 21. Apache Lucene Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ● Implemented in Java ● Provides indexing and searching libraries ● Ranked searching -- best results returned first ● Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more ● Fielded searching (e.g., title, author, contents) ● Date-range searching ● Sorting by any field
  • 22. Zend Lucene It's a port of Java Lucene to PHP ● Compatible with the Lucene index format ● Provides indexing and searching libraries ● Supports some of the lucene query language ● Support for indexing HTML documents: title, meta and body ● Has support for different field types: ● Keyword: Not tokenized ● UnIndexed: Not indexed ● Binary: Binary data ● Text: Tokenized ● UnStored: Tokenized, but only indexed
  • 23. Zend Lucene Indexing Example Normal: <?php // Open existing index $index = Zend_Search_Lucene::open('/data/my-index'); $doc = new Zend_Search_Lucene_Document(); // Store document URL to identify it in search result. $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl)); // Index document content $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent)); // Add document to the index. $index->addDocument($doc); ?> HTML document: <?php $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename); $index->addDocument($doc); ?>
  • 24. Zend Lucene Search Example With parser: <?php $index = Zend_Search_Lucene::open('/data/my-index'); $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr); $hits = $index->find($query); ?> With API: <?php $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr); $pathTerm = new Zend_Search_Lucene_Index_Term('/data/doc_dir/' . $filename, 'path'); $pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm); $query = new Zend_Search_Lucene_Search_Query_Boolean(); $query->addSubquery($userQuery); $query->addSubquery($pathQuery); $hits = $index->find($query); ?>
  • 25. Apache Solr Solr is a standalone enterprise search server with a web- services like API. It extends Lucene: ● Real schema, with numeric types, dynamic fields, unique keys ● Powerful extensions to the lucene query language ● Support for dynamic faceted browsing and filtering ● Advanced, configurable text analysis ● Highly configurable and user extensible caching ● Performance optimizations ● External configuration via xml ● An administration interface ● Monitorable logging ● Fast incremental updates and snapshot distribution ● XML and CSV/delimited-text update formats <a href='http://localhost:8983/solr/admin/'>demo</a>
  • 26. Marjory Marjory is a webservice for indexing and searching for documents, utilizing a full-text search engine. ● It is somewhat similar to Solr, but is written in PHP and the underlying architecture allows for using search engines other than Lucene (no other adaptor is implemented yet, though). ● Marjory is based on the Zend Framework and uses Zend_Search_Lucene as the default search engine.
  • 27. eZ Components' Search component ● As our current solution doesn't scale, we needed something new ● As Lucene is really good, the first attempt was to use the Java bridge ● Then Solr came out ● Which we integrated in quot;eZ Findquot; ● Next versions of eZ Publish will use eZ Components
  • 28. eZ Components' Search component Requirements and Design ● Support for multiple backends ● Abstract documents ● Support for datatypes ● Rich searching API, including facetted search ● Easy query interface
  • 29. eZ Components' Search component Document Definition <?php static public function getDefinition() { $n = new ezcSearchDocumentDefinition( 'ezcSearchSimpleArticle' ); $n->idProperty = 'id'; $n->fields['id'] = new ezcSearchDefinitionDocumentField( 'id', ezcSearchDocumentDefinition::TEXT ); $n->fields['title'] = new ezcSearchDefinitionDocumentField( 'title', ezcSearchDocumentDefinition::TEXT, 2, true, false, true ); $n->fields['body'] = new ezcSearchDefinitionDocumentField( 'body', ezcSearchDocumentDefinition::TEXT, 1, false, false, true ); $n->fields['published'] = new ezcSearchDefinitionDocumentField( 'published', ezcSearchDocumentDefinition::DATE ); $n->fields['url'] = new ezcSearchDefinitionDocumentField( 'url', ezcSearchDocumentDefinition::STRING ); $n->fields['type'] = new ezcSearchDefinitionDocumentField( 'type', ezcSearchDocumentDefinition::STRING, 0, true, false, false ); return $n; } ?>
  • 30. eZ Components' Search component Document Definition, in XML <?xml version=quot;1.0quot;?> <document> <field type=quot;idquot;>id</field> <field type=quot;textquot; boost=quot;2quot;>title</field> <field type=quot;textquot;>summary</field> <field inResult=quot;falsequot; type=quot;htmlquot;>body</field> <field type=quot;datequot;>published</field> <field type=quot;stringquot; multi=quot;truequot;>author</field> </document> Setting up the manager: <?php $backend = new ezcSearchSolrHandler; $session = new ezcSearchSession( $backend, new ezcSearchXmlManager( $testFilesDir ) ); ?>
  • 31. eZ Components' Search component Indexing <?php $session = new ezcSearchSession( $backend, new ezcSearchXmlManager( $testFilesDir ) ); $a = new Article( null, // id 'Test Article', // title 'This is an article to test', // description 'the body of the article', // body time() // published ); $session->index( $a ); ?>
  • 32. eZ Components' Search component Search - API <?php $session = new ezcSearchSession( $backend, new ezcSearchXmlManager( $testFilesDir ) ); $q = $session->createFindQuery( 'Article' ); $q->where( $q->eq( 'title', 'Article' ) ); ->limit( 5 ); ->orderBy( 'id' ); $r = $session->find( $q ); foreach ( $r->documents as $document ) { echo $document['document']->title, quot;nquot;; } ?>
  • 33. eZ Components' Search component Search - Query Builder <?php $session = new ezcSearchSession( $backend, new ezcSearchXmlManager( $testFilesDir ) ); $q = $session->createFindQuery( 'Article' ); new ezcSearchQueryBuilder( $q, 'thunderball', array( 'fieldOne', 'fieldTwo' ) ); $q->facet( 'title' ); // keyword data field $r = $session->find( $q ); foreach ( $r->documents as $document ) { echo $document['document']->title, quot;nquot;; } ?>
  • 34. Notes on Performance Solr vs Zend Lucene ● Both engines are not tuned ● Solr indexes about 25% faster for small documents (one sentence) ● Solr indexes about 200% faster for big documents (64kb)
  • 35. Future improvements ● More backends: google, marjory, sphinx, xapian, yahoo ● More features: SpellChecker, MoreLikeThis
  • 36. Resources Porter algoritihm: http://telemat.det.unifi.it/book/2001/wchange/download/stem _porter.html Solr: http://lucene.apache.org/solr/ Zend Lucene: http://framework.zend.com/manual/en/zend.search.lucene.ht ml SnowBall: http://snowball.tartarus.org/ Mecab (In Japanese): http://mecab.sourceforge.net/ These Slides: http://derickrethans.nl/talks.php