Archive

Archive for November, 2009

Updating Document Fields in Lucene

November 26, 2009 Leave a comment

Lucene 2.4.1 provides a convenient method for you to update a Document in your Index, namely the updateDocument method of IndexWriter (shown below) but what do you do if you want to update the Fields of an existing document?

public void updateDocument(Term term, Document doc)
                    throws CorruptIndexException, IOException

Lucene’s updateDocument operation is basically delete and insert wrapped into a single function. All documents matching the Term parameter are deleted from the Lucene index and the supplied Document instance is then inserted into the index. While Lucene allows multiple copies of the same document to exist in the index, the behavior of the update operation does not insert a copy of the supplied document for every match. In other words, if your Term matches 5 documents in the index then 5 documents are deleted and a single document is inserted in its place.

As you can see, it is a very good idea for you to design your documents so they have a field that uniquely identifies them in the entire index. In the database world, this is called a primary key field.

At times, it is helpful to think of the Lucene index as a database having a single table and the Documents as rows. It is a good analogy when you frame it in terms of searching. Boolean Queries seem to fit this concept nicely.

However, there are many differences between a Lucene index and a database.

  • Lucene does not provide a way to enforce field uniqueness. It is up to you to achieve the concept.
  • Lucene does not require a predefined document schema for the documents in the index. This means all documents in the index do not need to have the same number of fields or use the same field names. As an example, some documents can have the fields (id, url, contents) and other documents can have the fields (productid, manufacturer, summary, review).
  • Fields can be repeated in a document. For example, a document can have 3 product review fields (productid, manufacturer, summary, review, review, review). We will revist this later in the code example.

Lucene’s updateDocument method overwrites the document(s) matching the given Term with the given Document. This is a problem if you only want to update a few fields and keep the remainder.

In the scenario pictured below, you can uniquely identify a document in the index whose author field you would like to update. So you then call updateDocument and pass in the Term and a Document instance populated with the new author field value. The result is an updated author field and the loss of the 3 other fields previously stored – the title, publisher, and contents fields.

Visual of Lucene's update document method

What do you do when you need to update a subset of the fields in a document but cannot re-create the remaining fields? There can be many reasons for this dilemma. Perhaps you are unable to re-create the fields because the original text is not available or perhaps the operations to re-create the fields are very costly.

One approach to resolve this dilemma is to search for the current document in the index, change the desired fields, and use the modified document as the input to the updateDocument call. This idea is illustrated below. UpdateUtil.java contains the full source.

int docId = hits.scoreDocs[0].doc;
			
//retrieve the old document
Document doc = searcher.doc(docId);

List<Field> replacementFields = updateDoc.getFields();
for (Field field : replacementFields) {
	String name = field.name();
	String currentValue = doc.get(name);
	if (currentValue != null) {
		//replacement field value
		
		//remove all occurrences of the old field
		doc.removeFields(name);

		//insert the replacement
		doc.add(field);
	} else {
		//new field
		doc.add(field);
	}
}

//write the old document to the index with the modifications
writer.updateDocument(term, doc);

Here we pass in a Document that can have both replacement fields and additional fields for the document identified by a search using the term parameter as the basis for a TermQuery.. First we obtain the list of Fields from the document parameter. If the matched document already has a field by that name, it is considered a replacement otherwise it is a new field to be added to the document.

Notice the method first removes all fields in the Document having the same name as the replacement prior to inserting the replacement field. As mentioned earlier, a Lucene document can have multiple fields with the same name.

visual of documents stored in a lucene index

Without the remove call, you would be adding another value for the field instead of replacing the existing value.

A great tool to view what is actually in your Lucene index is Luke, the Lucene Index Toolbox. It is very helpful tool to answer “what if” questions when you read the Lucene API.

Out of the box, Lucene does not provide a way to update the individual fields of a document in the index. However, it is relatively easy to achieve this functionality by grouping together the available API calls.

You can browse the full source at google code and download a copy of the entire project via svn.
svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/

Advertisements
Categories: Java Tags: