Home > Uncategorized > Lucene Highlighter HowTo

Lucene Highlighter HowTo

Background
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear. Each search result displays the title, the URL, and a text fragment containing your search terms in bold.

Consider what happens when you search for ‘Apache’ at Google. Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County. The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space. Search would not be as user friendly as it is today without these fragments.

This post covers version 2.4.1 of Apache Lucene, the popular open source search engine library written in Java. It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result. The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar. The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.

If you are not familiar with Lucene, you can think of it as a library which provides

  • a way to create a search index from multiple text items
  • a way to quickly search the index and return the best matches.

A more thorough explanation of Lucene can be found at the Apache Lucene FAQ.

As an example of what the Lucene Highlighter can do, here is what appears when I search for ‘queue’ in an index of PDF documents.

e14510.pdf
Oracle Coherence Getting Started Guide
of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred…
duration of an item within the queue is configurable, and is referred to as the Write-Behind Delay. When data changes, it is added to the write-behind queue (if it is not already in the queue), and the queue entry is set to ripen after the configured Write-Behind Delay has passed…

The Steps
First, before you can display highlighted fragments with each search result, the text to highlight must be available. Shown below is a snippet of indexing code. We are storing the text that will be used to generate the fragment in the contents field.

Document doc = new Document();
doc.add(new Field("contents", contents, Field.Store.COMPRESS, 
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("title", bookTitle, Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filepath", f.getCanonicalPath(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);  

The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.

Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index. Lucene automatically uncompresses it when it is retrieved.

Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents. An example of an Analyzer is StandardAnalyzer. One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, …) from the text being indexed.

Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.

Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”

For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.

QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i < hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(“contents”);
	
	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer, 
                   query, “contents”, contents, 5, 100);
}

Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser. Lucene supports a sophisticated query language and QueryParser simplifies transforming the supplied string to a query object. Next, we get the top 10 results matching the query. This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.

Here is the code to generate the fragments:

TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents, 
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
		
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);		
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);

First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.

Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.

Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.

Finally, we tell the Highlighter to return the best 5 fragments.

The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java

You can also find examples of using Highlighter in the Lucene SVN Repository, specifically HighlighterTest.java

As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.

About these ads
Categories: Uncategorized Tags: , ,
  1. Dj
    November 6, 2009 at 1:32 am | #1

    Hi,

    I saw your code.
    Pretty impressive,i was indeed looking for Highlighter code.
    Thanx…
    I need some more information on indexing.
    How to update a file which is already index?
    Eg. i have a manual. i have already index.
    After indexing i make some changes and want to re-index for the changes.

    Thanx in advance.

    -Dj

    • Nick Hrycan
      November 9, 2009 at 12:44 am | #2

      Updating a document is pretty easy, just use the updateDocument method of IndexWriter. This method performs an add and a delete as an atomic operation. It overwrites the entire document with what you are supplying to the method.

      When you added your document to your index, you probably had a field in it that could be considered its “primary key” in the database world if you were dealing with a database row instead of a lucene document. The Term parameter given to updateDocument will be used to select all documents matching it and the Document parameter will be used as the replacement.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: