Starting with Zend_Search_Lucene

Posted March 11th, 2009 by Juozas

As websites grows, searches like “LIKE title ‘%search term%’” becomes unreliable. There are very good solutions like Sphinx, Lucene, etc, but not surprisingly, you can’t always have Sphinx installed (shared servers again) and other solutions should be chosen.

MySQL supports full-text indexing, but it doesn’t give a lot of control over actual index. Luckily, Zend team has done wonderful job and implemented Lucene search in PHP (100%). Zend_Search_Lucene is part of Zend Framework, but as all framework modules runs almost independently (it uses Zend_Exception, etc.).

How you start indexing data? Zend manual has very good examples how to start with Lucene, but to create sample index index you can use this code (you need to have auto-loading enabled and db connection available):

// Create index
$index = Zend_Search_Lucene::create('indexes/products');
 
$sql = "select product_name, product_url from products";
 
$results = $db->fetchAll($sql);
 
foreach ($results as $result)
{
    $doc = new Zend_Search_Lucene_Document();
 
    // Store document URL to identify it in the search results
    $doc->addField(
    Zend_Search_Lucene_Field::UnIndexed('url', $result->product_url));
 
    // Index document title
    $doc->addField(
    Zend_Search_Lucene_Field::Text('title', $result->product_name));
 
    // Add document to the index
    $index->addDocument($doc);
}
 
// Optimize index.
$index->optimize();

This simple code will select products information from database, loop through results and add them as documents to index. In this example I added url as UnIndexed, because I’m only going to search by title, but Lucene allows other field types. In most cases, product description or document text should be added (or maybe even indexed).

Searching through index is even easier. One thing you need to learn is how to construct search queries in required query language. Example:

// Open index
$index = Zend_Search_Lucene::open('indexes/products');
 
$query = 'title:"Apple MacBook"';
 
// Search by query
$hits = $index->find($query);
 
foreach ($hits as $hit) {
    echo $hit->score . " ";
    echo $hit->title . " ";
    echo $hit->url . PHP_EOL;
}

I tried creating index of 6′000 products, index (0.7 MB) was created in around 3 minutes and all searches takes about 0.1 s. I tested it on my laptop, without APC and with development Apache/PHP configuration. Normal servers would run this task much more faster, but 0.1 for search is not that bad.

Zend_Search_Lucene will not change Sphinx or Lucene, but in limited environments (like shared servers) it can be quite useful. It supports many query types: phrase queries, boolean queries, wildcard queries, proximity queries, range queries and many other, what can be hardly achieved with using full-text MySQL indexes.

Trackbacks/Pingbacks

  1. Juozas Kaziukenas’ Blog: Starting with Zend_Search_Lucene : Dragonfly Networks

Comments (9)

  1. robo47

    But currently it is important to filter data which you pass to the query-methode, because some things can easily get the script to reach memory_limit or max_execution_time.

    especially querys like * AND * AND * …. AND * will use LOTS of memory and can run for minutes or longer. the proximity search also offers some danger to long running querys when giving 2 equal words and a big number.

    thread in a german forum: http://www.zfforum.de/showthread.php?p=29697
    A related open Bug in the tracker: http://framework.zend.com/issues/browse/ZF-3321

  2. Juozas (author)

    Oh, I see.
    I read many (some) complaints about memory/speed issues. Now I need to add “query injections” to list what can cause problems.

    What about letting users to submit only keywords, not actual queries and then create them? It should work, but you limit yourself a lot :(

  3. robo47

    Without the ability to let the user write complete querys lucene looses a lot. All the nice power the Lucene Implementation gives is away.
    I currently only filter out * and ~, so most things the query language provides is still usable.
    After finding out about the * AND * … AND * -problem myself i made a lot of test-querys with the zend implementation and the only dangerous querys i was able to create contained * or ~ .
    It’s only some basic blacklisting but better than nothing and additional I have implemented a search-query-log which includes execution-time + memory-usage of the search, so if anything bad happens, I can analyze the querys and probably find a way to filter them.
    Would be nice to find a solution for this, for example a way to give the query a time limit which is checked in the search-process and allows to throw an exception after the time is exceeded, because if the script dies because of memory_limit or max_execution_time, there are only 2 ways this is shown: white page or if display_errors is on, an error to the user … both ways aren’t something i want to choose.

  4. selected

    it is fine as long as u use it for something small (like 6000 products or a blog like this). In my case (4 billion a4 documents) i would have to hung my self with zend implementation of lucene.

  5. Juozas (author)

    4 billion results clearly will kill Zend_Lucene :)) But all normal solutions (Lucene, Sphinx) should work fine, I guess.

  6. robo47

    Seems something gets fixed in ZF 1.7.7 Release

    http://framework.zend.com/issues/browse/ZF-3321
    http://framework.zend.com/code/changelog/Zend_Framework/?cs=14304

    Didn’t look what exactly got changed, but ticket got marked as fixed.

  7. Juozas (author)

    Patch shows added limits for max terms per query and prefix length. Probably next week I will try to test it with new version, but from source it looks that “* and * and *” issue should have been fixed.

  8. willian

    hi,

    i’m trying to just delete a doc that i added. i have 20 documents added. lucene says that the doc was deleted but when i search i have a result like this: “contains 20 documents”, but it must be 19. do you have any idea why its happening.

    thanks,
    willian

  9. Patrick

    What is the limit of records for the Zend implementation of Lucene?
    A time ago I inserted 10.000 records coming from a Mysql database (key/value combo’s) with value being a TEXT data type in Mysql.
    When I did a $index->find(“value:’test’”); I got too many records back, even resulting in scores of 4.08383+39E (or something like that) ???
    If I keep my records below 1000 everything seems to work fine… (Debian Etch GNU/Linux, PHP v5.2, Zend Framework v1.7.5)

    Regards,

    Patrick

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">