The document discusses a presentation about improving indexing speed for PHP applications. It begins by introducing common information retrieval libraries like Lucene, and compares the indexing speed of PHP and Java implementations of Lucene. Profiling tools show that the PHP implementation takes much longer than Java, largely due to overhead from PHP's dynamic nature which makes function calls slower compared to Java's optimized approach using virtual dispatch tables. The document explores reasons for the performance differences and aims to help PHP programmers understand execution times.
2. Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
● Conclusions
3. Index and search
● Problem of finding relevant information is not new.
– 3000 years BC [1]
– Vannevar Bush, As We May Think, 1945.
● Today applications that search the Web must be able to provide instant
access to > 10 billion documents
● Many applications need some form of search, eg searching your hard
drive, email....
1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3),
1618, 2005.
4. Options for information retrieval
● Search engines
– Nutch, SearchBlox.....
● Information Retrieval libraries
– Three with broadly similar features
Implementation Language Language
License
language bindings ports
Egothor Java None None BSD like
Perl, Python,
Xapian C++ None GPL
PHP, Java, TCL
C++, Perl,
Lucene Java None Apache 2
PHP, C#
5. Lucene [2]
DB
Web
Application
File
system
Get user
Gather query Present search
data results
User
Index Search
Lucene
documents index
Index
2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
6. Lucene indexing
1. Documents
2. Token stream Index
Analysis creation
Oh for a muse of
fire that would
. [fire] [ascend] [bright] [heaven]
acsend the
brightest
start end
heaven of
invention.....
Terms Documents
fire Henry V, Scouting for boys...
ascend Aerospace, Henry V...
Optimise
...
4. Optimised inverted index 3. Inverted index
7. Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
● Conclusions
8. Indexing speed
Benchmark:
●17.4 MB, 814 files of PHP source code
●Linux/Thinkpad T60
Time to index Time to optimise
Total time
/seconds /seconds
PHP 167 43 210
Java 32 3 35
Java + JIT 4 0.3 4.3
Ouch! nearly 50 times as fast in Java
9. Why is the performance so
bad?
First make sure we are comparing same thing:
➢ Analyser
➢ Java Lucene has many analysers
➢ Limits on terms
➢ Java stops looking at 10,000 terms
➢ Scoring
➢ Java rounds down, PHP rounds to closest
➢ Compare indexes using Luke
13. Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
– Part one
– Part two
● Conclusions
14. Execution profiles
● Now that we are definitely comparing the same thing, look at
execution profiles for Java and PHP implementations
● Profiling tools (all open source)
– Java
● Eclipse TPTP
– PHP
● Xdebug
● KCachegrind
– System
● Sysprof
● vmstat, iostat
16. Small problems with TPTP...
Benchmark data:
● 39 files of PHP source code (php/Zend), 1.2 MB
Time to index Time to optimise
% time in indexing
/seconds /seconds
Java + profile 687258 673851 50
Java 2.3 0.3 88
●Invasive and slow. Takes 600,000 times as long to execute
●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)
●Output file is machine readable only
But – it's free, open source and it works enough.
22. System profile
1. Convert to lower case
2. Look up opcodes
23. How Xdebug works
●Convert function name to lower case
●Look up function in function table
ZEND_INIT_METHOD_CALL
Script execution
DO_FCALL_BY_NAME Call out to profiler – start time
Execute function
Call out to profiler – end time
24. The normalize() function
Sum( ) = 2.92; Is consumed in setting up
functions to be run
18.99 – 2.92 = 16.07
25. Why is function calling faster in
Java?
● Java is a static language. VM structures are known at start up –
can't add code on the fly, types are known at compile time.
● First time a function is called Java caches a reference to it in a
virtual dispatch table. After that function calls are fast.
● In PHP, code can be added during execution, for example,
create_function() and types are not known till code is executed.
This makes keeping virtual dispatch tables much more difficult.
26. Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
– Part one
– Part two
● Conclusions
28. look at the call to normalize()
$token = $this>normalize(
new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));
public function normalize(Token $srcToken )
{
$newToken = new Token(strtolower( $srcToken>getTermText() ),
$srcToken>getStartOffset(),
$srcToken>getEndOffset());
$newToken>setPositionIncrement($srcToken>getPositionIncrement());
return $newToken;
}
29. look at the call to normalize()
$token = $this>normalize(
new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));
normalize() recoded....
public function normalize (Token $srcToken) {
$srcToken>setTermText(strtolower($srcToken>getTermtext()));
return $srcToken;
}
31. Performance improvement?
Time to index Time to optimise
Total time
/seconds /seconds
PHP 167 43 210
PHP + fix 151 43 194
Java 32 3 35
Java + JIT 4 0.3 4.3
9.5 % improvement
32. Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
– Part one
– Part two
● Conclusions
33. Conclusions
● Two reasons why the PHP implementation of Lucene is slow:
– Function calling overhead in PHP
– Inefficient code in the analyser [3]
– These are the main two, there are others....
● Dynamic and fast?
– Hard to get to the same execution speed as Java – but possible to get closer.
– But development speed is much better [4]– what speed to you care about?
– Better not to use Java coding style (lots of methods that do nothing)
● So which implementation of Lucene should I use?
– it depends.....
3. http://framework.zend.com/issues/browse/ZF-3683
4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
34. Options for PHP
Y Can Y Y
Do you
support Java Use a Web Use SOLR as
care about
environment? Service? web service
speed?
N N N
N
Only No Lucene
Use Lucene via
need basic solution
a Java bridge
features? today [5]
Y
Use Zend
Search Lucene
5. http://pecl.php.net/package/clucene
35. Acknowledgements
● Rob Young's presentation [6] to the London PHP user group.
● Members of the PHP internals community, in particular Scott MacVicar,
Derick Rethans and Dmitry Stogov.
6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)