Archive

Posts Tagged ‘lucene’

Paginating Lucene Search Results

February 10, 2010 1 comment

Most search results are returned to users in a paginated form. In other words, only X results are shown to the user on a single page and the user has a way to navigate to the next X results.

Let’s use Google as an example. By default, you are shown the first 10 best results matching your query. At the bottom of the page, there are numbered links representing the corresponding pages of the search results. Clicking on “3” gives you the 3rd page of results. With a page size of 10, the results shown are 21 -30.

Searching for “lucene” at Google, it responds with
“Results 1 – 10 of about 2,440,000 for lucene. (0.33 seconds)”

Imagine if Google did not use pagination and instead returned everything to you at once on a single page! That would be a very large page and take a very long time to load.

If you use Lucene, you too can present your search results in a paginated manner even though Lucene does not provide a direct way to do it in their API (2.4.1). Thanks to how Lucene is designed, it is very easy to implement.
Read more…

Advertisements
Categories: Java Tags: ,

Updating Document Fields in Lucene

November 26, 2009 Leave a comment

Lucene 2.4.1 provides a convenient method for you to update a Document in your Index, namely the updateDocument method of IndexWriter (shown below) but what do you do if you want to update the Fields of an existing document?

public void updateDocument(Term term, Document doc)
                    throws CorruptIndexException, IOException

Lucene’s updateDocument operation is basically delete and insert wrapped into a single function. All documents matching the Term parameter are deleted from the Lucene index and the supplied Document instance is then inserted into the index. While Lucene allows multiple copies of the same document to exist in the index, the behavior of the update operation does not insert a copy of the supplied document for every match. In other words, if your Term matches 5 documents in the index then 5 documents are deleted and a single document is inserted in its place.

As you can see, it is a very good idea for you to design your documents so they have a field that uniquely identifies them in the entire index. In the database world, this is called a primary key field.

At times, it is helpful to think of the Lucene index as a database having a single table and the Documents as rows. It is a good analogy when you frame it in terms of searching. Boolean Queries seem to fit this concept nicely.

However, there are many differences between a Lucene index and a database.

  • Lucene does not provide a way to enforce field uniqueness. It is up to you to achieve the concept.
  • Lucene does not require a predefined document schema for the documents in the index. This means all documents in the index do not need to have the same number of fields or use the same field names. As an example, some documents can have the fields (id, url, contents) and other documents can have the fields (productid, manufacturer, summary, review).
  • Fields can be repeated in a document. For example, a document can have 3 product review fields (productid, manufacturer, summary, review, review, review). We will revist this later in the code example.

Lucene’s updateDocument method overwrites the document(s) matching the given Term with the given Document. This is a problem if you only want to update a few fields and keep the remainder.

In the scenario pictured below, you can uniquely identify a document in the index whose author field you would like to update. So you then call updateDocument and pass in the Term and a Document instance populated with the new author field value. The result is an updated author field and the loss of the 3 other fields previously stored – the title, publisher, and contents fields.

Visual of Lucene's update document method

What do you do when you need to update a subset of the fields in a document but cannot re-create the remaining fields? There can be many reasons for this dilemma. Perhaps you are unable to re-create the fields because the original text is not available or perhaps the operations to re-create the fields are very costly.

One approach to resolve this dilemma is to search for the current document in the index, change the desired fields, and use the modified document as the input to the updateDocument call. This idea is illustrated below. UpdateUtil.java contains the full source.

int docId = hits.scoreDocs[0].doc;
			
//retrieve the old document
Document doc = searcher.doc(docId);

List<Field> replacementFields = updateDoc.getFields();
for (Field field : replacementFields) {
	String name = field.name();
	String currentValue = doc.get(name);
	if (currentValue != null) {
		//replacement field value
		
		//remove all occurrences of the old field
		doc.removeFields(name);

		//insert the replacement
		doc.add(field);
	} else {
		//new field
		doc.add(field);
	}
}

//write the old document to the index with the modifications
writer.updateDocument(term, doc);

Here we pass in a Document that can have both replacement fields and additional fields for the document identified by a search using the term parameter as the basis for a TermQuery.. First we obtain the list of Fields from the document parameter. If the matched document already has a field by that name, it is considered a replacement otherwise it is a new field to be added to the document.

Notice the method first removes all fields in the Document having the same name as the replacement prior to inserting the replacement field. As mentioned earlier, a Lucene document can have multiple fields with the same name.

visual of documents stored in a lucene index

Without the remove call, you would be adding another value for the field instead of replacing the existing value.

A great tool to view what is actually in your Lucene index is Luke, the Lucene Index Toolbox. It is very helpful tool to answer “what if” questions when you read the Lucene API.

Out of the box, Lucene does not provide a way to update the individual fields of a document in the index. However, it is relatively easy to achieve this functionality by grouping together the available API calls.

You can browse the full source at google code and download a copy of the entire project via svn.
svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/

Categories: Java Tags:

Lucene Highlighter HowTo

October 25, 2009 2 comments

Background
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear. Each search result displays the title, the URL, and a text fragment containing your search terms in bold.

Consider what happens when you search for ‘Apache’ at Google. Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County. The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space. Search would not be as user friendly as it is today without these fragments.

This post covers version 2.4.1 of Apache Lucene, the popular open source search engine library written in Java. It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result. The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar. The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.

If you are not familiar with Lucene, you can think of it as a library which provides

  • a way to create a search index from multiple text items
  • a way to quickly search the index and return the best matches.

A more thorough explanation of Lucene can be found at the Apache Lucene FAQ.

As an example of what the Lucene Highlighter can do, here is what appears when I search for ‘queue’ in an index of PDF documents.

e14510.pdf
Oracle Coherence Getting Started Guide
of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred…
duration of an item within the queue is configurable, and is referred to as the Write-Behind Delay. When data changes, it is added to the write-behind queue (if it is not already in the queue), and the queue entry is set to ripen after the configured Write-Behind Delay has passed…

The Steps
First, before you can display highlighted fragments with each search result, the text to highlight must be available. Shown below is a snippet of indexing code. We are storing the text that will be used to generate the fragment in the contents field.

Document doc = new Document();
doc.add(new Field("contents", contents, Field.Store.COMPRESS, 
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("title", bookTitle, Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filepath", f.getCanonicalPath(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);  

The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.

Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index. Lucene automatically uncompresses it when it is retrieved.

Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents. An example of an Analyzer is StandardAnalyzer. One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, …) from the text being indexed.

Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.

Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”

For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.

QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i < hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(“contents”);
	
	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer, 
                   query, “contents”, contents, 5, 100);
}

Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser. Lucene supports a sophisticated query language and QueryParser simplifies transforming the supplied string to a query object. Next, we get the top 10 results matching the query. This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.

Here is the code to generate the fragments:

TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents, 
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
		
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);		
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);

First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.

Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.

Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.

Finally, we tell the Highlighter to return the best 5 fragments.

The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java

You can also find examples of using Highlighter in the Lucene SVN Repository, specifically HighlighterTest.java

As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.

Categories: Uncategorized Tags: , ,