SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Can you be dynamic and 
         fast?
 “Miss Marple and the case of the Missing MIPS”

                  Zoë Slattery
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times


●   Conclusions
Index and search


●   Problem of finding relevant information is not new.
     – 3000 years BC [1]
     – Vannevar Bush, As We May Think, 1945.



●   Today applications that search the Web must be able to provide instant 
    access to > 10 billion documents


●   Many applications need some form of search, eg searching your hard 
    drive, email....


            1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 
            16­18, 2005.
Options for information retrieval
●   Search engines
     – Nutch, SearchBlox.....

●   Information Retrieval libraries
     – Three with broadly similar features



               Implementation     Language       Language
                                                               License
                 language          bindings        ports

     Egothor       Java             None           None        BSD like

                                 Perl, Python,
     Xapian         C++                            None          GPL
                                PHP, Java, TCL
                                                 C++, Perl, 
     Lucene        Java             None                       Apache 2
                                                  PHP, C#
Lucene [2]

                       DB
                                      Web
Application




              File
              system
                                                              Get user 
                                    Gather                     query           Present search 
                                     data                                          results

                                                                                                                  User



                                        Index                        Search
Lucene




                                      documents                       index




                                                      Index



                       2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
Lucene indexing
1. Documents
                                                2. Token stream                Index 
                      Analysis                                                 creation
 Oh for a muse of
  fire that would 
              .                   [fire]   [ascend]  [bright]  [heaven]
     acsend the
      brightest 
                              start   end
     heaven of 
    invention.....
                                                Terms              Documents



                                                  fire    Henry V, Scouting for boys...
                                                ascend       Aerospace, Henry V...
                          Optimise
                                                  ...


4. Optimised inverted index                                3. Inverted index
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times


●   Conclusions
Indexing speed

Benchmark:
●17.4 MB, 814 files of PHP source code

●Linux/Thinkpad T60




                     Time to index   Time to optimise
                                                           Total time
                       /seconds         /seconds
        PHP              167               43                 210

       Java               32                3                  35

     Java + JIT           4                0.3                 4.3




                                          Ouch! nearly 50 times as fast in Java
Why is the performance so 
           bad?

First make sure we are comparing same thing:


 ➢   Analyser
     ➢   Java Lucene has many analysers

 ➢   Limits on terms
     ➢   Java stops looking at 10,000 terms

 ➢   Scoring
     ➢   Java rounds down, PHP rounds to closest

 ➢   Compare indexes using Luke
Analysis ­ Java
Analyzing quot;A Quick Brown Fox jumped over the Lazy Dogquot;
StandardAnalyzer:
   [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

SimpleAnalyzer:
   [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer:
   [quick] [brown] [fox] [jumped] [over] [lazy] [dog]


Analyzing quot;XY&Z Corporation - xyz@example.comquot;
StandardAnalyzer:
   [xy&z] [corporation] [xyz@example.com]

SimpleAnalyzer:
   [xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer:
   [xy] [z] [corporation] [xyz] [example] [com]
Analysis ­ PHP
Analysing quot;A Quick Brown Fox jumped over the Lazy Dogquot;
Default (lower case) filter:
[a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Stop words filter:
[quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Short words filter:
[quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]


Analysing quot;XY&Z Corporation - xyz@example.comquot;
Default (lower case) filter:
[xy] [z] [corporation] [xyz] [example] [com]

Stop words filter:
[xy] [z] [corporation] [xyz] [example] [com]

Short words filter:
[xy] [corporation] [xyz] [example] [com]
Compare indexes

                         java

        Same 663 terms




                                php
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     – Part one
     – Part two


●   Conclusions
Execution profiles
●   Now that we are definitely comparing the same thing, look at
    execution profiles for Java and PHP implementations


●   Profiling tools (all open source)

     –   Java
           ● Eclipse TPTP




     –   PHP
          ● Xdebug

          ● KCachegrind




     –   System
          ● Sysprof

          ● vmstat, iostat
Java profile
Small problems with TPTP...

Benchmark data:
 ● 39 files of PHP source code (php/Zend), 1.2 MB 




                      Time to index     Time to optimise
                                                         % time in indexing
                        /seconds           /seconds
    Java + profile       687258              673851               50

        Java               2.3                 0.3                88


●Invasive and slow. Takes 600,000 times as long to execute
●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)

●Output file is machine readable only




But – it's free, open source and it works enough.
PHP profile
No problems with this tool

Benchmark data:
 ● 39 files of PHP source code (php/Zend), 1.2 MB 




                      Time to index       Time to optimise
                                                           % time in indexing
                        /seconds             /seconds
 PHP + profile              70                   55                   56

       PHP                   5                    3                   63


   ●Not so invasive as the Java tool  but still adds to time and distorts slightly
   ●Results easy to display with KCachegrind

   ●Output file is readable
The normalize() function




   Sum( ) = 2.92;  

  18.99 – 2.92 = 16.07 
Micro benchmark


<?php 
        require_once quot;Token.phpquot;; 
        require_once quot;LowerCase.phpquot;; 

        $token = new Token(quot;GOquot;, 105, 107); 
        $filter = new LowerCase(); 

        for ($i=0; $i < 10000000; $i++) { 
                $norm_token = $filter­>normalize($token); 
        } 
?> 
normalize() opcodes
compiled vars:  !0 = $srcToken, !1 = $newToken 
line     #  op                       ext  return   operands 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 
11     0  RECV                       1 
13     1  ZEND_FETCH_CLASS                  :0      'Token' 
       2  NEW                               $1      :0 
       3  ZEND_INIT_METHOD_CALL                     !0, 'getTermText' 
       4  DO_FCALL_BY_NAME           0 
       5  SEND_VAR_NO_REF                           $3 
       6  DO_FCALL                   1              'strtolower' 
       7  SEND_VAR_NO_REF                           $4 
14     8  ZEND_INIT_METHOD_CALL                     !0, 'getStartOffset' 
       9  DO_FCALL_BY_NAME           0 
      10  SEND_VAR_NO_REF                           $6 
15    11  ZEND_INIT_METHOD_CALL                     !0, 'getEndOffset' 
      12  DO_FCALL_BY_NAME           0 
      13  SEND_VAR_NO_REF                           $8 
      14  DO_FCALL_BY_NAME           3 
      15  ASSIGN                                    !1, $1 
16    ......
System profile



       1. Convert to lower case
       2. Look up opcodes
How Xdebug works

                   ●Convert function name to lower case
                   ●Look up function in function table
                                                                ZEND_INIT_METHOD_CALL
Script execution




                       DO_FCALL_BY_NAME                   Call out to profiler – start time 


                          Execute function


                                                          Call out to profiler – end time 
The normalize() function




   Sum( ) = 2.92;         Is consumed in setting up 
                          functions to be run
  18.99 – 2.92 = 16.07 
Why is function calling faster in 
             Java?
●   Java is a static language. VM structures are known at start up –
    can't add code on the fly, types are known at compile time.

●   First time a function is called Java caches a reference to it in a
    virtual dispatch table. After that function calls are fast.

●   In PHP, code can be added during execution, for example,
    create_function() and types are not known till code is executed.
    This makes keeping virtual dispatch tables much more difficult.
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     – Part one
     – Part two


●   Conclusions
PHP profile
look at the call to normalize()
$token = $this­>normalize(
    new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));




public function normalize(Token $srcToken )
{

         $newToken = new Token(strtolower( $srcToken­>getTermText() ),
                                $srcToken­>getStartOffset(),
                                $srcToken­>getEndOffset());

        $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken;
    }
look at the call to normalize()
$token = $this­>normalize(
    new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));




normalize() recoded....

public function normalize (Token $srcToken) {
       $srcToken­>setTermText(strtolower($srcToken­>getTermtext()));
       return $srcToken;
    }
After fix
Performance improvement?



              Time to index   Time to optimise
                                                   Total time
                /seconds         /seconds
   PHP            167               43                210

 PHP + fix        151               43                194

   Java            32                3                35

 Java + JIT        4                0.3               4.3

                                          9.5 % improvement
Agenda


●   Index and search applications


●   The problem for PHP programmers


●   Understanding execution times
     – Part one
     – Part two


●   Conclusions
Conclusions

●   Two reasons why the PHP implementation of Lucene is slow:
      –   Function calling overhead in PHP
      –   Inefficient code in the analyser [3]
      –   These are the main two, there are others....



●   Dynamic and fast?
      –   Hard to get to the same execution speed as Java – but possible to get closer.
      –   But development speed is much better [4]– what speed to you care about?
      –   Better not to use Java coding style (lots of methods that do nothing)



●   So which implementation of Lucene should I use?
      –   it depends.....



3. http://framework.zend.com/issues/browse/ZF-3683
4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
Options for PHP 
                 Y           Can         Y                     Y
   Do you 
                         support Java         Use a Web                 Use SOLR as 
 care about 
                         environment?          Service?                  web service
   speed?




         N                       N                    N
                     N
    Only                  No Lucene 
                                             Use Lucene via
 need basic                solution 
                                              a Java bridge
  features?                today [5]



         Y


  Use Zend 
Search Lucene


                                              5. http://pecl.php.net/package/clucene
Acknowledgements


●   Rob Young's presentation [6] to the London PHP user group.


●   Members of the PHP internals community, in particular Scott MacVicar, 
    Derick Rethans and Dmitry Stogov.




                          6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)
Other useful links

●http://www.egothor.org/
●http://xapian.org/
●http://lucene.apache.org/

●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

●http://www.derickrethans.nl/vld.php

●http://lucene.apache.org/nutch/

●http://www.searchblox.com/

●http://www.xdebug.org/

●http://www.eclipse.org/tptp/

●http://www.getopt.org/luke/

●http://www.projectzero.org

●http://www.ibm.com/developerworks/ (Publication due 24/09/08)

●http://php-java-bridge.sourceforge.net/doc/

●http://www.zend.com/en/products/platform/product-comparison/java-bridge

●http://lucene.apache.org/solr/

●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html

Weitere ähnliche Inhalte

Was ist angesagt?

So you think you know REST - DPC11
So you think you know REST - DPC11So you think you know REST - DPC11
So you think you know REST - DPC11Evert Pot
 
44CON London 2015 - Going AUTH the Rails on a Crazy Train
44CON London 2015 - Going AUTH the Rails on a Crazy Train44CON London 2015 - Going AUTH the Rails on a Crazy Train
44CON London 2015 - Going AUTH the Rails on a Crazy Train44CON
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyAll Things Open
 
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...Guillaume Laforge
 
Plack basics for Perl websites - YAPC::EU 2011
Plack basics for Perl websites - YAPC::EU 2011Plack basics for Perl websites - YAPC::EU 2011
Plack basics for Perl websites - YAPC::EU 2011leo lapworth
 
Short intro to scala and the play framework
Short intro to scala and the play frameworkShort intro to scala and the play framework
Short intro to scala and the play frameworkFelipe
 
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019Viktor Todorov
 
Django Performance Recipes
Django Performance RecipesDjango Performance Recipes
Django Performance RecipesJon Atkinson
 
Getting Started with WebSocket and Server-Sent Events in Java
Getting Started with WebSocket and Server-Sent Events in JavaGetting Started with WebSocket and Server-Sent Events in Java
Getting Started with WebSocket and Server-Sent Events in JavaArun Gupta
 
Making the Most of HTTP In Your Apps
Making the Most of HTTP In Your AppsMaking the Most of HTTP In Your Apps
Making the Most of HTTP In Your AppsBen Ramsey
 
Play Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaPlay Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaYevgeniy Brikman
 
High Performance Ajax Applications
High Performance Ajax ApplicationsHigh Performance Ajax Applications
High Performance Ajax ApplicationsSiarhei Barysiuk
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangleeby
 
Web Clients for Ruby and What they should be in the future
Web Clients for Ruby and What they should be in the futureWeb Clients for Ruby and What they should be in the future
Web Clients for Ruby and What they should be in the futureToru Kawamura
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelFuseSource.com
 
Java Enterprise Edition Concurrency Misconceptions
Java Enterprise Edition Concurrency Misconceptions Java Enterprise Edition Concurrency Misconceptions
Java Enterprise Edition Concurrency Misconceptions Haim Yadid
 

Was ist angesagt? (18)

So you think you know REST - DPC11
So you think you know REST - DPC11So you think you know REST - DPC11
So you think you know REST - DPC11
 
44CON London 2015 - Going AUTH the Rails on a Crazy Train
44CON London 2015 - Going AUTH the Rails on a Crazy Train44CON London 2015 - Going AUTH the Rails on a Crazy Train
44CON London 2015 - Going AUTH the Rails on a Crazy Train
 
Stop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case StudyStop Worrying & Love the SQL - A Case Study
Stop Worrying & Love the SQL - A Case Study
 
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...
JavaOne 2008 - TS-5793 - Groovy and Grails, changing the landscape of Java EE...
 
Practical Kerberos
Practical KerberosPractical Kerberos
Practical Kerberos
 
Plack basics for Perl websites - YAPC::EU 2011
Plack basics for Perl websites - YAPC::EU 2011Plack basics for Perl websites - YAPC::EU 2011
Plack basics for Perl websites - YAPC::EU 2011
 
Short intro to scala and the play framework
Short intro to scala and the play frameworkShort intro to scala and the play framework
Short intro to scala and the play framework
 
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
The Recording HTTP Proxy: Not Yet Another Messiah - Bulgaria PHP 2019
 
T2
T2T2
T2
 
Django Performance Recipes
Django Performance RecipesDjango Performance Recipes
Django Performance Recipes
 
Getting Started with WebSocket and Server-Sent Events in Java
Getting Started with WebSocket and Server-Sent Events in JavaGetting Started with WebSocket and Server-Sent Events in Java
Getting Started with WebSocket and Server-Sent Events in Java
 
Making the Most of HTTP In Your Apps
Making the Most of HTTP In Your AppsMaking the Most of HTTP In Your Apps
Making the Most of HTTP In Your Apps
 
Play Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaPlay Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and Scala
 
High Performance Ajax Applications
High Performance Ajax ApplicationsHigh Performance Ajax Applications
High Performance Ajax Applications
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangle
 
Web Clients for Ruby and What they should be in the future
Web Clients for Ruby and What they should be in the futureWeb Clients for Ruby and What they should be in the future
Web Clients for Ruby and What they should be in the future
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Java Enterprise Edition Concurrency Misconceptions
Java Enterprise Edition Concurrency Misconceptions Java Enterprise Edition Concurrency Misconceptions
Java Enterprise Edition Concurrency Misconceptions
 

Ähnlich wie Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialPASCAL Jean Marie
 
The Integration of Laravel with Swoole
The Integration of Laravel with SwooleThe Integration of Laravel with Swoole
The Integration of Laravel with SwooleAlbert Chen
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology OverviewDan Lynn
 
Cwinters Intro To Rest And JerREST and Jersey Introductionsey
Cwinters Intro To Rest And JerREST and Jersey IntroductionseyCwinters Intro To Rest And JerREST and Jersey Introductionsey
Cwinters Intro To Rest And JerREST and Jersey Introductionseyelliando dias
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client DevelopmentTamir Khason
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariJoseph Scott
 
Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Scott Hernandez
 
Web Development: The Next Five Years
Web Development: The Next Five YearsWeb Development: The Next Five Years
Web Development: The Next Five Yearssneeu
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformWangda Tan
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Railsdosire
 
Ruby Conf Preso
Ruby Conf PresoRuby Conf Preso
Ruby Conf PresoDan Yoder
 
Apache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesApache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesPeter
 

Ähnlich wie Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008 (20)

Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Api Design
Api DesignApi Design
Api Design
 
Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search Tutorial
 
The Integration of Laravel with Swoole
The Integration of Laravel with SwooleThe Integration of Laravel with Swoole
The Integration of Laravel with Swoole
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Cwinters Intro To Rest And JerREST and Jersey Introductionsey
Cwinters Intro To Rest And JerREST and Jersey IntroductionseyCwinters Intro To Rest And JerREST and Jersey Introductionsey
Cwinters Intro To Rest And JerREST and Jersey Introductionsey
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client Development
 
Site Performance - From Pinto to Ferrari
Site Performance - From Pinto to FerrariSite Performance - From Pinto to Ferrari
Site Performance - From Pinto to Ferrari
 
Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)Realtime Analytics with MongoDB Counters (mongonyc 2012)
Realtime Analytics with MongoDB Counters (mongonyc 2012)
 
Web Development: The Next Five Years
Web Development: The Next Five YearsWeb Development: The Next Five Years
Web Development: The Next Five Years
 
Rest Vs Soap Yawn2289
Rest Vs Soap Yawn2289Rest Vs Soap Yawn2289
Rest Vs Soap Yawn2289
 
MySQL Proxy tutorial
MySQL Proxy tutorialMySQL Proxy tutorial
MySQL Proxy tutorial
 
How Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On TimeHow Concur uses Big Data to get you to Tableau Conference On Time
How Concur uses Big Data to get you to Tableau Conference On Time
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
MySQL Proxy
MySQL ProxyMySQL Proxy
MySQL Proxy
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
Ruby Conf Preso
Ruby Conf PresoRuby Conf Preso
Ruby Conf Preso
 
Apache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build SitesApache Solr Changes the Way You Build Sites
Apache Solr Changes the Way You Build Sites
 

Kürzlich hochgeladen

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Kürzlich hochgeladen (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

  • 1. Can you be dynamic and  fast? “Miss Marple and the case of the Missing MIPS” Zoë Slattery
  • 2. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  • 3. Index and search ● Problem of finding relevant information is not new. – 3000 years BC [1] – Vannevar Bush, As We May Think, 1945. ● Today applications that search the Web must be able to provide instant  access to > 10 billion documents ● Many applications need some form of search, eg searching your hard  drive, email.... 1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3),  16­18, 2005.
  • 4. Options for information retrieval ● Search engines – Nutch, SearchBlox..... ● Information Retrieval libraries – Three with broadly similar features Implementation Language Language License language bindings ports Egothor Java None None BSD like Perl, Python, Xapian C++ None GPL PHP, Java, TCL C++, Perl,  Lucene Java None Apache 2 PHP, C#
  • 5. Lucene [2] DB Web Application File system Get user  Gather query Present search  data results User Index Search Lucene documents index Index 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
  • 6. Lucene indexing 1. Documents 2. Token stream Index  Analysis creation Oh for a muse of  fire that would  . [fire]   [ascend]  [bright]  [heaven] acsend the brightest  start end heaven of  invention..... Terms Documents fire Henry V, Scouting for boys... ascend Aerospace, Henry V... Optimise ... 4. Optimised inverted index 3. Inverted index
  • 7. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times ● Conclusions
  • 8. Indexing speed Benchmark: ●17.4 MB, 814 files of PHP source code ●Linux/Thinkpad T60 Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 Java 32 3 35 Java + JIT 4 0.3 4.3 Ouch! nearly 50 times as fast in Java
  • 9. Why is the performance so  bad? First make sure we are comparing same thing: ➢ Analyser ➢ Java Lucene has many analysers ➢ Limits on terms ➢ Java stops looking at 10,000 terms ➢ Scoring ➢ Java rounds down, PHP rounds to closest ➢ Compare indexes using Luke
  • 10. Analysis ­ Java Analyzing quot;A Quick Brown Fox jumped over the Lazy Dogquot; StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing quot;XY&Z Corporation - xyz@example.comquot; StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
  • 11. Analysis ­ PHP Analysing quot;A Quick Brown Fox jumped over the Lazy Dogquot; Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing quot;XY&Z Corporation - xyz@example.comquot; Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]
  • 12. Compare indexes java Same 663 terms php
  • 13. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 14. Execution profiles ● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations ● Profiling tools (all open source) – Java ● Eclipse TPTP – PHP ● Xdebug ● KCachegrind – System ● Sysprof ● vmstat, iostat
  • 16. Small problems with TPTP... Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds Java + profile 687258 673851 50 Java 2.3 0.3 88 ●Invasive and slow. Takes 600,000 times as long to execute ●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts) ●Output file is machine readable only But – it's free, open source and it works enough.
  • 18. No problems with this tool Benchmark data: ● 39 files of PHP source code (php/Zend), 1.2 MB  Time to index Time to optimise % time in indexing /seconds /seconds PHP + profile 70 55 56 PHP 5 3 63 ●Not so invasive as the Java tool  but still adds to time and distorts slightly ●Results easy to display with KCachegrind ●Output file is readable
  • 19. The normalize() function Sum( ) = 2.92;   18.99 – 2.92 = 16.07 
  • 21. normalize() opcodes compiled vars:  !0 = $srcToken, !1 = $newToken  line     #  op                   ext  return   operands  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­  11     0  RECV 1  13     1  ZEND_FETCH_CLASS :0 'Token'         2  NEW $1 :0         3  ZEND_INIT_METHOD_CALL !0, 'getTermText'         4  DO_FCALL_BY_NAME 0         5  SEND_VAR_NO_REF $3         6  DO_FCALL 1     'strtolower'         7  SEND_VAR_NO_REF $4  14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'         9  DO_FCALL_BY_NAME 0        10  SEND_VAR_NO_REF $6  15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'        12  DO_FCALL_BY_NAME 0        13  SEND_VAR_NO_REF $8        14  DO_FCALL_BY_NAME 3        15  ASSIGN  !1, $1  16    ......
  • 22. System profile 1. Convert to lower case 2. Look up opcodes
  • 23. How Xdebug works ●Convert function name to lower case ●Look up function in function table ZEND_INIT_METHOD_CALL Script execution DO_FCALL_BY_NAME Call out to profiler – start time  Execute function Call out to profiler – end time 
  • 24. The normalize() function Sum( ) = 2.92;   Is consumed in setting up  functions to be run 18.99 – 2.92 = 16.07 
  • 25. Why is function calling faster in  Java? ● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time. ● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast. ● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.
  • 26. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 28. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) {          $newToken = new Token(strtolower( $srcToken­>getTermText() ),                                 $srcToken­>getStartOffset(),                                 $srcToken­>getEndOffset());         $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());      return $newToken;     }
  • 29. look at the call to normalize() $token = $this­>normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); normalize() recoded.... public function normalize (Token $srcToken) { $srcToken­>setTermText(strtolower($srcToken­>getTermtext())); return $srcToken; }
  • 31. Performance improvement? Time to index Time to optimise Total time /seconds /seconds PHP 167 43 210 PHP + fix 151 43 194 Java  32 3 35 Java + JIT 4 0.3 4.3 9.5 % improvement
  • 32. Agenda ● Index and search applications ● The problem for PHP programmers ● Understanding execution times – Part one – Part two ● Conclusions
  • 33. Conclusions ● Two reasons why the PHP implementation of Lucene is slow: – Function calling overhead in PHP – Inefficient code in the analyser [3] – These are the main two, there are others.... ● Dynamic and fast? – Hard to get to the same execution speed as Java – but possible to get closer. – But development speed is much better [4]– what speed to you care about? – Better not to use Java coding style (lots of methods that do nothing) ● So which implementation of Lucene should I use? – it depends..... 3. http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
  • 34. Options for PHP  Y Can  Y Y Do you  support Java  Use a Web  Use SOLR as  care about  environment? Service? web service speed? N N N N Only  No Lucene  Use Lucene via need basic  solution   a Java bridge features? today [5] Y Use Zend  Search Lucene 5. http://pecl.php.net/package/clucene
  • 35. Acknowledgements ● Rob Young's presentation [6] to the London PHP user group. ● Members of the PHP internals community, in particular Scott MacVicar,  Derick Rethans and Dmitry Stogov. 6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)