advertisement

Search Lucene

50 %
50 %
advertisement
Information about Search Lucene
Technology

Published on February 20, 2009

Author: phpcodemonkey

Source: slideshare.net

Description

Zoe Slattery's slides from PHPNW08:

The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications.

For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation.

In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.
advertisement

Can you be dynamic and fast? “ Miss Marple and the case of the Missing MIPS” Zoë Slattery

Agenda Index and search applications The problem for PHP programmers Understanding execution times Conclusions

Index and search applications

The problem for PHP programmers

Understanding execution times

Conclusions

Index and search Problem of finding relevant information is not new. 3000 years BC [1] Vannevar Bush, As We May Think, 1945. Today applications that search the Web must be able to provide instant access to > 10 billion documents Many applications need some form of search, eg searching your hard drive, email.... 1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.

Problem of finding relevant information is not new.

3000 years BC [1]

Vannevar Bush, As We May Think, 1945.

Today applications that search the Web must be able to provide instant access to > 10 billion documents

Many applications need some form of search, eg searching your hard drive, email....

Options for information retrieval Search engines Nutch, SearchBlox..... Information Retrieval libraries Three with broadly similar features Egothor Xapian Lucene Implementation language Language bindings Language ports License Java None None BSD like C++ Perl, Python, PHP, Java, TCL None GPL Java None C++, Perl, PHP, C# Apache 2

Search engines

Nutch, SearchBlox.....

Information Retrieval libraries

Three with broadly similar features

Lucene [2] 2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005. DB Web File system Get user query Present search results Index Index documents Search index Gather data Lucene Application User

Lucene indexing start 3. Inverted index 1. Documents Analysis Index creation Optimise 4. Optimised inverted index . Oh for a muse of fire that would acsend the brightest heaven of invention..... fire ascend ... Henry V, Scouting for boys... Aerospace, Henry V... Terms Documents end [fire] [ascend] [bright] [heaven] 2. Token stream

Agenda Index and search applications The problem for PHP programmers Understanding execution times Conclusions

Index and search applications

The problem for PHP programmers

Understanding execution times

Conclusions

Indexing speed Benchmark: 17.4 MB, 814 files of PHP source code Linux/Thinkpad T60 Java + JIT Java PHP 4 32 167 Time to index /seconds 0.3 3 43 Time to optimise /seconds 4.3 35 210 Total time Ouch! nearly 50 times as fast in Java

Benchmark:

17.4 MB, 814 files of PHP source code

Linux/Thinkpad T60

Why is the performance so bad? First make sure we are comparing same thing: Compare indexes using Luke Limits on terms Java stops looking at 10,000 terms Scoring Java rounds down, PHP rounds to closest Analyser Java Lucene has many analysers

First make sure we are comparing same thing:

Compare indexes using Luke

Limits on terms

Java stops looking at 10,000 terms

Scoring

Java rounds down, PHP rounds to closest

Analyser

Java Lucene has many analysers

Analysis - Java Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Analyzing "XY&Z Corporation - xyz@example.com" StandardAnalyzer: [xy&z] [corporation] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

Analysis - PHP Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] Analysing "XY&Z Corporation - xyz@example.com" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com] Stop words filter: [xy] [z] [corporation] [xyz] [example] [com] Short words filter: [xy] [corporation] [xyz] [example] [com]

Compare indexes Same 663 terms java php

Agenda Index and search applications The problem for PHP programmers Understanding execution times Part one Part two Conclusions

Index and search applications

The problem for PHP programmers

Understanding execution times

Part one

Part two

Conclusions

Execution profiles Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations Profiling tools (all open source) Java Eclipse TPTP PHP Xdebug KCachegrind System Sysprof vmstat, iostat

Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations

Profiling tools (all open source)

Java

Eclipse TPTP

PHP

Xdebug

KCachegrind

System

Sysprof

vmstat, iostat

Java profile

Small problems with TPTP... Invasive and slow. Takes 600,000 times as long to execute Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts) Output file is machine readable only But – it's free, open source and it works enough. Benchmark data: 39 files of PHP source code (php/Zend), 1.2 MB Java Java + profile 2.3 687258 Time to index /seconds 0.3 673851 Time to optimise /seconds 88 50 % time in indexing

Invasive and slow. Takes 600,000 times as long to execute

Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)

Output file is machine readable only

But – it's free, open source and it works enough.

Benchmark data:

39 files of PHP source code (php/Zend), 1.2 MB

PHP profile

No problems with this tool Not so invasive as the Java tool but still adds to time and distorts slightly Results easy to display with KCachegrind Output file is readable Benchmark data: 39 files of PHP source code (php/Zend), 1.2 MB PHP PHP + profile 5 70 Time to index /seconds 3 55 Time to optimise /seconds 63 56 % time in indexing

Not so invasive as the Java tool but still adds to time and distorts slightly

Results easy to display with KCachegrind

Output file is readable

Benchmark data:

39 files of PHP source code (php/Zend), 1.2 MB

look at the normalize() code public function normalize(Token $srcToken ) { $newToken = new Token( strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }

The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07

Micro benchmark <?php require_once &quot;Token.php&quot;; require_once &quot;LowerCase.php&quot;; $token = new Token(&quot;GO&quot;, 105, 107); $filter = new LowerCase(); for ($i=0; $i < 10000000; $i++) { $norm_token = $filter->normalize($token); } ?>

normalize() opcodes compiled vars: !0 = $srcToken, !1 = $newToken line # op ext return operands ---------------------------------------------------------------------------- 11 0 RECV 1 13 1 ZEND_FETCH_CLASS :0 'Token' 2 NEW $1 :0 3 ZEND_INIT_METHOD_CALL !0, 'getTermText' 4 DO_FCALL_BY_NAME 0 5 SEND_VAR_NO_REF $3 6 DO_FCALL 1 'strtolower' 7 SEND_VAR_NO_REF $4 14 8 ZEND_INIT_METHOD_CALL !0, 'getStartOffset' 9 DO_FCALL_BY_NAME 0 10 SEND_VAR_NO_REF $6 15 11 ZEND_INIT_METHOD_CALL !0, 'getEndOffset' 12 DO_FCALL_BY_NAME 0 13 SEND_VAR_NO_REF $8 14 DO_FCALL_BY_NAME 3 15 ASSIGN !1, $1 16 ......

System profile 1. Convert to lower case 2. Look up opcodes

How Xdebug works Script execution Convert function name to lower case Look up function in function table Execute function Call out to profiler – start time Call out to profiler – end time ZEND_INIT_METHOD_CALL DO_FCALL_BY_NAME

Convert function name to lower case

Look up function in function table

The normalize() function Sum( ) = 2.92; 18.99 – 2.92 = 16.07 Is consumed in setting up functions to be run

Why is function calling faster in Java? Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time. First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast. In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.

First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.

In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Agenda Index and search applications The problem for PHP programmers Understanding execution times Part one Part two Conclusions

Index and search applications

The problem for PHP programmers

Understanding execution times

Part one

Part two

Conclusions

PHP profile

look at the call to normalize() $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize(Token $srcToken ) { $newToken = new Token(strtolower( $srcToken->getTermText() ), $srcToken->getStartOffset(), $srcToken->getEndOffset()); $newToken->setPositionIncrement($srcToken->getPositionIncrement()); return $newToken; }

look at the call to normalize() normalize() recoded.... $token = $this->normalize( new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos)); public function normalize (Token $srcToken) { $ srcToken->setTermText(strtolower($srcToken->getTermtext())); return $srcToken; }

After fix

Performance improvement? PHP + fix PHP 151 167 Time to index /seconds 43 43 Time to optimise /seconds Java 32 3 35 194 210 Total time 9.5 % improvement Java + JIT 4 0.3 4.3

Agenda Index and search applications The problem for PHP programmers Understanding execution times Part one Part two Conclusions

Index and search applications

The problem for PHP programmers

Understanding execution times

Part one

Part two

Conclusions

Conclusions Two reasons why the PHP implementation of Lucene is slow: Function calling overhead in PHP Inefficient code in the analyser [3] These are the main two, there are others.... Dynamic and fast? Hard to get to the same execution speed as Java – but possible to get closer. But development speed is much better [4]– what speed to you care about? Better not to use Java coding style (lots of methods that do nothing) So which implementation of Lucene should I use? it depends..... 3. http://framework.zend.com/issues/browse/ZF-3683 4. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.

Two reasons why the PHP implementation of Lucene is slow:

Function calling overhead in PHP

Inefficient code in the analyser [3]

These are the main two, there are others....

Dynamic and fast?

Hard to get to the same execution speed as Java – but possible to get closer.

But development speed is much better [4]– what speed to you care about?

Better not to use Java coding style (lots of methods that do nothing)

So which implementation of Lucene should I use?

it depends.....

Options for PHP Y Y Y N N N N Y 5. http://pecl.php.net/package/clucene Do you care about speed? Use Zend Search Lucene Only need basic features? Can support Java environment? Use a Web Service? Use Lucene via a Java bridge No Lucene solution today [5] Use SOLR as web service

Other useful links http://www.egothor.org/ http://xapian.org/ http://lucene.apache.org/ http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html http://www.derickrethans.nl/vld.php http://lucene.apache.org/nutch/ http://www.searchblox.com/ http://www.xdebug.org/ http://www.eclipse.org/tptp/ http://www.getopt.org/luke/ http://www.projectzero.org http://www.ibm.com/developerworks/ (Publication due 24/09/08) http://php-java-bridge.sourceforge.net/doc/ http://www.zend.com/en/products/platform/product-comparison/java-bridge http://lucene.apache.org/solr/ http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html

http://www.egothor.org/

http://xapian.org/

http://lucene.apache.org/

http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

http://www.derickrethans.nl/vld.php

http://lucene.apache.org/nutch/

http://www.searchblox.com/

http://www.xdebug.org/

http://www.eclipse.org/tptp/

http://www.getopt.org/luke/

http://www.projectzero.org

http://www.ibm.com/developerworks/ (Publication due 24/09/08)

http://php-java-bridge.sourceforge.net/doc/

http://www.zend.com/en/products/platform/product-comparison/java-bridge

http://lucene.apache.org/solr/

http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html

Add a comment

Related presentations

Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...

In this presentation we will describe our experience developing with a highly dyna...

Presentation to the LITA Forum 7th November 2014 Albuquerque, NM

Un recorrido por los cambios que nos generará el wearabletech en el futuro

Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...

Microsoft finally joins the smartwatch and fitness tracker game by introducing the...

Related pages

Apache Lucene - Welcome to Apache Lucene

The Apache Lucene TM project develops open-source search software, including: Lucene Core, our flagship sub-project, provides Java-based indexing and ...
Read more

Apache Lucene - Apache Lucene Core

Apache Lucene TM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any ...
Read more

Apache Lucene.Net

Congratulations to the Apache Lucene.Net Developers and Community! We've made it out of incubation. Seems like a relatively quick road; About a year an a ...
Read more

GitHub - owncloud/search_lucene: The ownCloud search ...

README.md Search Lucene. The Search Lucene app adds a full text search for files stored in ownCloud. It is based on Zend Search Lucene and can index ...
Read more

Apache Lucene – Wikipedia

Apache Lucene ist eine Programmbibliothek zur Volltextsuche und ein Projekt der Apache ... Der Name war eine Abkürzung für Search on Lucene and Resin.
Read more

Lucene - Wikipedia, the free encyclopedia

Apache Lucene is a free and open-source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache ...
Read more

Search Lucene

To search Lucene and all its subprojects, enter your query ant hit the "Search" button.
Read more

Apache Lucene.Net

Apache Lucene.Net 3.0.3 - Oct 10, 2012. Use the links below to download the Lucene.Net from one of our mirrors. Mirror.
Read more

Lucene Tutorial - University of California, Los Angeles

Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucene to provide full-text indexing across both database ...
Read more

LuceneFAQ - Lucene-java Wiki

Lucene-java Wiki. Login. LuceneFAQ. FrontPageEN; RecentChanges; FindPage; ... Zend Search Lucene is not at all related to the Apache Lucene ...
Read more