August 6, 2014

How to Index and Search using Sphinx on Ubuntu

OS & Software Versions Limitations #

The author has used Sphinx 2.0.8 on Ubuntu 13.10 64-bit systems.

For systems with different packaging frameworks like RPMs, one needs to find packages corresponding to ones mentioned here.
An older or newer Ubuntu system may have a different package dependency tree, so you may see errors about a particular package not being found. But in most cases, the errors should provide you the alternative.
With an older or newer major version of Sphinx, the usage of its components may be different than what is used here.

Install Sphinx #

Install required dependency packages as follows:

$ sudo apt-get install gcc g++ make
$ sudo apt-get install libxml2-dev libreadline6-dev libexpat1-dev libexpat1

Download Sphinx 2.0.8 source from this page and extract into a folder in your home directory.

$ tar xvfz sphinx-2.0.8-release.tar.gz
$ cd sphinx-2.0.8-release

Feel free to get newest release from here. It is at 2.1.9 at the point of writing this article, and the steps listed here may work with this version. If it is at 3.x or higher major version, you may get errors.

Configure, build and install Sphinx as follows.

$ ./configure --without-mysql
$ make
$ sudo make install

Please note that the author used the without MySQL configuration. In case you need the MySQL mode queries, remove the --without-mysql parameter from the configure command.

The Sphinx binaries get installed in /usr/local/bin/.

Indexing your Data #

Sphinx has a program called indexer at /usr/local/bin/indexer which is a command line application. To index your documents, you need to do something like this (hypothetically): $ cat all-your-docs | indexer. However; it’s a little less straightforward than this.

Structure #

Let’s follow a directory organization so that scripts and configuration files used in this article can be easily used by copying and making minimal changes.

Let’s say you are creating an index on a set of news articles. Do the following:

$ sudo mkdir -p /search/data/news /search/index/news /search/conf /search/bin/news

/search/data/news - This directory would store all your articles
/search/index/news - This directory would contain all files of your index once created
/search/conf - This directory would hold your Sphinx configuration files used through indexing and searching phases
/search/bin/news - This directory would hold scripts we use for indexing phase

Indexing Configuration - Part 1 #

Create the file /search/conf/sphinx.conf. Add the following configuration text to it:

source news
{
    type                    = xmlpipe
    xmlpipe_command         = /search/bin/news/xmlout.sh
    xmlpipe_field           = content
    xmlpipe_attr_string     = url
    xmlpipe_attr_uint       = date
    xmlpipe_fixup_utf8      = 1
}

Let’s understand the above config section.

source news

This means that this section defines an indexing source, and it is identified by the name news. We’ll soon see how this ‘name’ is used.

type = xmlpipe

This tells Sphinx that this source is in XML format and is piped to the indexer application.

In this article, we’ll see how to use an XML format source for indexing. There are other ways; but we’ll stick to the XML format only.

xmlpipe_command = /search/bin/news/xmlout.sh

This is the command that would be run to output the XML format source for indexing. The script /search/bin/news/xmlout.sh is going to be written by us ahead in the article.

This script would generate an XML format stream that consists of all documents that need to be indexed. All documents will be indexed at one go.

Fields and Attributes

At this point, let’s understand a bit about fields and attributes - the concepts related to indexing and searching.

The content of a document that needs to be indexed is called a field. Intuitively, for a news article, you may think of the whole content as one field.

An attribute is information associated with a document.

When one searches by providing a set of keywords, it is matched against the index formed using the content in fields of all documents. For all documents matched, what is returned is the set of attributes of these documents.

xmlpipe_field           = content
xmlpipe_attr_string     = url
xmlpipe_attr_uint       = date

This tells Sphinx indexer that for each document:

The text under the XML tag content should be treated as a field (so it’s indexed).
The text under the XML tag url should be treated as a string attribute.
The text under the XML tag date should be treated as a numerical attribute

In this example, we want the URL and date of a news article to be returned for matched documents for certain application logic. Think about what attributes would be useful for your application and replace this example’s attributes with yours.

Numerical Date?

A numerical date attribute sounds a bit counter-intuitive. Here is an explanation and bit of experience. While searching one can specify a range of values of a given attribute. For example, we could search for documents for a given set of keywords but also between a given range of dates. Sphinx does not work well with range of strings; but needs an attribute to be numeric if you want to use a range of its values.

xmlpipe_fixup_utf8      = 1

This tells the Sphinx indexer to deal with UTF-8 text in the XML stream, and not bail out with errors.

XML Source #

Let’s say you have 2 news articles, with following contents.

There has been a major hike in the auto-rickshaw fares today in Bangalore. This is third such price hike in less than 2 years.

And,

High-profile ministers of the government had a meeting with Anna Hazare to understand his demands. Team Anna has welcomed this. At the same time, a number of Team Anna leaders do not see much point in this.

The Sphinx indexer then would look for an XML stream like the following:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
    <sphinx:document id="1">
        <url>url-here</url>
        <date>date-here</date>
        <content>There has been a major hike in the auto-rickshaw fares today in Bangalore. This is third such price hike in less than 2 years.</content>
    </sphinx:document>
    <sphinx:document id="2">
        <url>url-here</url>
        <date>date-here</date>
        <content>High-profile ministers of the government had a meeting with Anna Hazare to understand his demands. Team Anna has welcomed this. At the same time, a number of Team Anna leaders do not see much point in this.</content>
    </sphinx:document>
</sphinx:docset>

Things to note in the above are:

Each document being indexed need to have a numeric id.
See how there is a tag in each document for all fields and attributes mentioned in the configuration file.

Write the /search/bin/news/xmlout.sh file to generate the above XML from your documents. Here is the sample one:

#!/bin/bash
echo "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
echo "<sphinx:docset>"
counter=0
for filename in `find /search/data/news/ -type f`
do
   url=`head -n 2 $filename | tail -n 1`
   url=`echo $url | sed 's/&/%26/g'`
   date=`head -n 1 $filename | sed 's/\-//g'`
   counter=`expr $counter + 1`
   lines=`wc -l $filename | awk '{print $1}'`
   lines=`expr $lines - 2`
   echo "<sphinx:document id=\"$counter\">"
   echo "<url>$url</url>"
   echo "<date>$date</date>"
   echo "<images>$images</images>"
   echo -n "<content>"
   tail -n $lines $filename | sed 's/&/and/g'
   echo "</content>"
   echo "</sphinx:document>"
done
echo "</sphinx:docset>"

Things to note and understand in the above sample script are:

Note how ‘&’ is replaced with its URL encoding or a logical equivalent word and. Dealing with typical XML parsers would have told you how they don’t like the ‘&’ character. That experience continues here. In case you don’t replace ‘&’ with some acceptable alternative, the Sphinx indexer would complain and bail out.
Extending the above point would tell you that there should be no stray tags or ‘<’ and ‘>’ characters as well inside your fields and attributes. Such XML stream would make the indexer bail out with errors as well. Note how clean our example news articles are.
We kept date and URL in 1st and 2nd line, respectively, of the files that contain news articles for an easy way to write this script to generate the required XML format indexing source containing all attributes and fields.
The date is assumed to be stored in ISO format and converted to a numeric form (the date attribute is supposed to be numeric as per above config)
We kept all files in /search/data/news/. One file per document to be indexed.

Indexing Configuration - Part 2 #

Add the following to /search/conf/sphinx.conf.

index news
{
    source                  = news
    path                    = /search/index/news/idx
    docinfo                 = extern
    mlock                   = 0
    morphology              = stem_en, soundex
    min_word_len            = 1
    charset_type            = utf-8
    html_strip              = 1
}

The section here defines/describes the index news. Basically, it tells how to create this index; where is it stored etc.

source = news

It tells Sphinx use the source with name news (the first config section defined above) to build this index called news.

path = /search/index/toi/idx

All files in the index news are stored in /search/index/news/ folder and begin with ‘idx’.

All other settings are standard/most common and you can learn about them separately. Here is one that’s specifically about web pages being indexed.

html_strip = 1

It seems self-explanatory. Nevertheless, the indexer would strip all HTML tags from the XML format source feed it receives before indexing.

Finally, a bit of configuration about the indexer runtime:

indexer
{
    mem_limit               = 128M
}

This tells the maximum memory to use for Indexer.

Sphinx indexer does not always produce very helpful errors. I noticed that while running it on 512MB RAM systems for indexing purpose I had, it would just crib with some vague error; while it worked perfectly fine on 1GB RAM systems for the same amount of data. Not much information is available on what’s going on here. I just experimented and guessed there may be memory related issues and moving to higher memory worked.

Finally.. index them! #

The following command would do the indexing:

$ sudo /usr/local/bin/indexer --config /search/conf/sphinx.conf --verbose news

There would be an intimation after every 1000 documents. If your files are good, indexing would complete and produce some stats about how many documents were indexed and more technical stuff, which you may not care about.

Check existence of your index files:

$ ls -l /search/index/news/

Troubleshooting #

In case there is an issue with your XML format feed, the indexer would provide you the document id and line number having an issue it could not deal with. In our setup, running /search/bin/news/xmlout.sh and looking at the mentioned line number should help you figure out the issue. In most cases seen so far, it’s about malformed XML (& character etc.).

Searching #

The Sphinx program searchd found at /usr/local/bin/searchd is the daemon that serves searching against any indexes.

Search configuration #

The searchd program needs to know about the indices to serve searches against. In our example, the news index is already specified in the /search/conf/sphinx.conf, and it has enough information for searchd including the location of the files of the index.

We need to add only a configuration for the searchd runtime. Add the following to /search/conf/sphinx.conf.

searchd
{
    listen                  = 9312
    log                     = /var/log/searchd.log
    query_log               = /var/log/query.log
    read_timeout            = 5
    client_timeout          = 300
    max_children            = 30
    pid_file                = /var/log/searchd.pid
    max_matches             = 1000
    seamless_rotate         = 1
    preopen_indexes         = 1
    unlink_old              = 1
    mva_updates_pool        = 1M
    max_packet_size         = 8M
    max_filters             = 256
    max_filter_values       = 4096
    max_batch_queries       = 32
    workers                 = threads # for RT to work
    dist_threads            = 4
}

Most of these values are standard/common, and for the purpose of getting your search to work, it’s not necessary to understand their details.

Start the Search Daemon #

Start the search daemon with the following command:

$ sudo /usr/local/bin/searchd -c /search/conf/sphinx.conf

There would be messages showing which indexes were read etc. and the searchd should start and be ready to serve searches at this point.

Searching through PHP #

Sphinx search has multiple clients. Here we show how to use their PHP API towards building a website to perform searches.

Copy the file sphinxapi.php in api subdir under the sphinx source to your htdocs or PHP code folder.

Here is a sample PHP file that does search using the above PHP API.

<?php header('Content-Type: text/plain; charset=iso-8859-1');
include('sphinxapi.php');
$cl = new SphinxClient();
$cl->SetServer("localhost", 9312);
$cl->SetSortMode(SPH_SORT_RELEVANCE);
$results = (int)$_GET['results'];
$offset = (int)$_GET['offset'];
$cl->SetLimits($offset, $results);
$min_date = (int)$_GET['mindate'];
$max_date = (int)$_GET['maxdate'];
$cl->SetFilterRange('date', $min_date, $max_date);
if ($_GET['mode'] == 'All') {
        $cl->SetMatchMode(SPH_MATCH_ALL);
}
else if ($_GET['mode'] == 'Any') {
        $cl->SetMatchMode(SPH_MATCH_ANY);
}
else if ($_GET['mode'] == 'Phrase') {
        $cl->SetMatchMode(SPH_MATCH_PHRASE);
}
else if ($_GET['mode'] == 'Extended') {
        $cl->SetMatchMode(SPH_MATCH_EXTENDED);
}
$keywords = preg_replace("/&quot;/", "\"", $_GET['keywords']);
$result = $cl->Query( $keywords, $_GET['index'] );
if ( $result === false ) {
    echo "ERROR|Query failed: " . $cl->GetLastError() . "\n";
}
else {
    if ( $cl->GetLastWarning() ) {
        echo "WARNING|" . $cl->GetLastWarning() . "\n";
    }
    if ( ! empty($result['matches']) ) {
        print_r($result['matches']);
    }
}
?>

Play with search queries now. For example:

$ curl "http://localhost/search.php?keywords=Anna+Team&results=20&offset=0&mindate=20090101&maxdate=20140806&mode=Phrase&index=news"

Specifying a range using mindate and maxdate and by calling SetFilterRange is optional. You can remove the code handling them.

Study Sphinx documentation to understand about what each mode means.

What Next? #

In another post, I would cover how to set up distributed search indices using Sphinx.

Author’s Sphinx Usage Background #

Sphinx was used as the search platform for author’s product http://storywatch.in. Here is some basic info about the setup:

A total of 4 million articles were indexed (around 20GB of data).
The index was partitioned across 6 Ubuntu VMs (1 VCPU, 512MB RAM, and 20GB SSD).

Kudos