Home > Java > Paginating Lucene Search Results

Paginating Lucene Search Results

Most search results are returned to users in a paginated form. In other words, only X results are shown to the user on a single page and the user has a way to navigate to the next X results.

Let’s use Google as an example. By default, you are shown the first 10 best results matching your query. At the bottom of the page, there are numbered links representing the corresponding pages of the search results. Clicking on “3” gives you the 3rd page of results. With a page size of 10, the results shown are 21 -30.

Searching for “lucene” at Google, it responds with
“Results 1 – 10 of about 2,440,000 for lucene. (0.33 seconds)”

Imagine if Google did not use pagination and instead returned everything to you at once on a single page! That would be a very large page and take a very long time to load.

If you use Lucene, you too can present your search results in a paginated manner even though Lucene does not provide a direct way to do it in their API (2.4.1). Thanks to how Lucene is designed, it is very easy to implement.

In Lucene, when you perform a search you do not retrieve the actual documents immediately, instead you get an array of ranked pointers to the document matches. Specifically, the TopDocs object returned contains an array of ScoreDoc objects which contain the document IDs of the matching documents. They are ordered by decreasing relevancy to the user supplied search term, so the best matches are at the front. It is the call to the IndexSearcher object, specifically IndexSearcher.doc(docID), which pulls the document into memory. Lucene’s lazy loading of documents is very beneficial for pagination.

As we can see in the Lucene FAQ, the Lucene developers recommended approach to pagination is re-executing the search and navigating to the correct position in the ScoreDoc array.

As a result, pagination becomes a problem of finding the correct start position and end position in the ScoreDoc array.
Here is a paginated search:

Query query = qp.parse(searchTerm);
TopDocs hits = searcher.search(query, maxNumberOfResults);
ArrayLocation arrayLocation = paginator.calculateArrayLocation(hits.scoreDocs.length, pageNumber, pageSize);

for (int i = arrayLocation.getStart() - 1; i < arrayLocation.getEnd(); i++) {
	int docId = hits.scoreDocs[i].doc;

	//load the document
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(searchField);

}

Full source: SearchServiceImpl.java

Here are the details of calculating the locations in the ScoreDocs array:

public class Paginator {

	public ArrayLocation calculateArrayLocation(int totalHits, int pageNumber,
		int pageSize) {
		ArrayLocation al = new ArrayLocation();

		if (totalHits < 1 || pageNumber < 1 || pageSize < 1) {
			al.setStart(0);
			al.setEnd(0);
			return al;
		}

		int start= 1 + (pageNumber -1) * pageSize;
		int end = Math.min(pageNumber * pageSize, totalHits);
		if (start > end) {
			start = Math.max(1, end - pageSize);
		}

		al.setStart(start);
		al.setEnd(end);
		return al;
	}
}

Full source: Paginator.java

The majority of the time, your users will find what they need on the first page of search results since the matches most relevant to their search are returned at the top of the list. I think most people don’t look beyond the first few pages and instead revise their search terms if they don’t immediately see what they need.

The benefits of pagination include:

  • lower response time since fewer documents are processed on the server side
  • lower memory use since fewer documents are loaded
  • lower network use since the amount of data sent over the network to the client is reduced

The end result is a better user experience.

Here are some numbers to give you a general idea of the impact of pagination. It shows the same search being performed with and without pagination and its effect on response time. MaxResults is the argument passed to the IndexSearcher.search(query, maxresults) method which bounds the number of results returned.

Paginated MaxResults Number of Matches Response Time
Yes 500 500 5ms
No 500 500 186ms
Yes 1000 1000 5ms
No 1000 1000 415ms
Yes 5000 2110 5ms
No 5000 2110 1007ms
Advertisements
Categories: Java Tags: ,
  1. Burk
    March 29, 2010 at 11:59 pm

    Nick , ur stats are spot on & information regarding the pagination technique is worth appreciating . I would simply like to add value to the discussion by describing the pagination technique in a slightly different context.
    If u suppose the search results as hits , then these hits provide an iterator over the results, caching a fixed sized window of 200 hits (roughly – not sure of this number). When all of the docs in cache have been iterated over, the search is performed again, and the cache is populated with the next window of hits from the complete list of hits, if there are more hits available. In your case of 1M+ hits, the query would be re-executed 1M+/200 = 5K+ times!
    U can just chk out the below mentioned link and get to discover this fundamental
    http://search.lucidimagination.com/search/document/CDRG_ch06_6.1

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: