Updating Document Fields in Lucene

November 26, 2009 Leave a comment

Lucene 2.4.1 provides a convenient method for you to update a Document in your Index, namely the updateDocument method of IndexWriter (shown below) but what do you do if you want to update the Fields of an existing document?

public void updateDocument(Term term, Document doc)
                    throws CorruptIndexException, IOException

Lucene’s updateDocument operation is basically delete and insert wrapped into a single function. All documents matching the Term parameter are deleted from the Lucene index and the supplied Document instance is then inserted into the index. While Lucene allows multiple copies of the same document to exist in the index, the behavior of the update operation does not insert a copy of the supplied document for every match. In other words, if your Term matches 5 documents in the index then 5 documents are deleted and a single document is inserted in its place.

As you can see, it is a very good idea for you to design your documents so they have a field that uniquely identifies them in the entire index. In the database world, this is called a primary key field.

At times, it is helpful to think of the Lucene index as a database having a single table and the Documents as rows. It is a good analogy when you frame it in terms of searching. Boolean Queries seem to fit this concept nicely.

However, there are many differences between a Lucene index and a database.

  • Lucene does not provide a way to enforce field uniqueness. It is up to you to achieve the concept.
  • Lucene does not require a predefined document schema for the documents in the index. This means all documents in the index do not need to have the same number of fields or use the same field names. As an example, some documents can have the fields (id, url, contents) and other documents can have the fields (productid, manufacturer, summary, review).
  • Fields can be repeated in a document. For example, a document can have 3 product review fields (productid, manufacturer, summary, review, review, review). We will revist this later in the code example.

Lucene’s updateDocument method overwrites the document(s) matching the given Term with the given Document. This is a problem if you only want to update a few fields and keep the remainder.

In the scenario pictured below, you can uniquely identify a document in the index whose author field you would like to update. So you then call updateDocument and pass in the Term and a Document instance populated with the new author field value. The result is an updated author field and the loss of the 3 other fields previously stored – the title, publisher, and contents fields.

Visual of Lucene's update document method

What do you do when you need to update a subset of the fields in a document but cannot re-create the remaining fields? There can be many reasons for this dilemma. Perhaps you are unable to re-create the fields because the original text is not available or perhaps the operations to re-create the fields are very costly.

One approach to resolve this dilemma is to search for the current document in the index, change the desired fields, and use the modified document as the input to the updateDocument call. This idea is illustrated below. UpdateUtil.java contains the full source.

int docId = hits.scoreDocs[0].doc;
			
//retrieve the old document
Document doc = searcher.doc(docId);

List<Field> replacementFields = updateDoc.getFields();
for (Field field : replacementFields) {
	String name = field.name();
	String currentValue = doc.get(name);
	if (currentValue != null) {
		//replacement field value
		
		//remove all occurrences of the old field
		doc.removeFields(name);

		//insert the replacement
		doc.add(field);
	} else {
		//new field
		doc.add(field);
	}
}

//write the old document to the index with the modifications
writer.updateDocument(term, doc);

Here we pass in a Document that can have both replacement fields and additional fields for the document identified by a search using the term parameter as the basis for a TermQuery.. First we obtain the list of Fields from the document parameter. If the matched document already has a field by that name, it is considered a replacement otherwise it is a new field to be added to the document.

Notice the method first removes all fields in the Document having the same name as the replacement prior to inserting the replacement field. As mentioned earlier, a Lucene document can have multiple fields with the same name.

visual of documents stored in a lucene index

Without the remove call, you would be adding another value for the field instead of replacing the existing value.

A great tool to view what is actually in your Lucene index is Luke, the Lucene Index Toolbox. It is very helpful tool to answer “what if” questions when you read the Lucene API.

Out of the box, Lucene does not provide a way to update the individual fields of a document in the index. However, it is relatively easy to achieve this functionality by grouping together the available API calls.

You can browse the full source at google code and download a copy of the entire project via svn.
svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/

Categories: Java Tags:

Source now available at Google Code

October 26, 2009 Leave a comment

I’ve uploaded the full source code for the Lucene Highlighter example to my project at Google Code. The code for the earlier posts will be there in the next few weeks.

You can browse the full source online or you can use subversion to checkout the code for a local copy. As an example, to checkout the lucene code at the command line:

svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/
Categories: Uncategorized

Lucene Highlighter HowTo

October 25, 2009 2 comments

Background
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear. Each search result displays the title, the URL, and a text fragment containing your search terms in bold.

Consider what happens when you search for ‘Apache’ at Google. Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County. The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space. Search would not be as user friendly as it is today without these fragments.

This post covers version 2.4.1 of Apache Lucene, the popular open source search engine library written in Java. It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result. The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar. The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.

If you are not familiar with Lucene, you can think of it as a library which provides

  • a way to create a search index from multiple text items
  • a way to quickly search the index and return the best matches.

A more thorough explanation of Lucene can be found at the Apache Lucene FAQ.

As an example of what the Lucene Highlighter can do, here is what appears when I search for ‘queue’ in an index of PDF documents.

e14510.pdf
Oracle Coherence Getting Started Guide
of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred…
duration of an item within the queue is configurable, and is referred to as the Write-Behind Delay. When data changes, it is added to the write-behind queue (if it is not already in the queue), and the queue entry is set to ripen after the configured Write-Behind Delay has passed…

The Steps
First, before you can display highlighted fragments with each search result, the text to highlight must be available. Shown below is a snippet of indexing code. We are storing the text that will be used to generate the fragment in the contents field.

Document doc = new Document();
doc.add(new Field("contents", contents, Field.Store.COMPRESS, 
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("title", bookTitle, Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filepath", f.getCanonicalPath(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);  

The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.

Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index. Lucene automatically uncompresses it when it is retrieved.

Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents. An example of an Analyzer is StandardAnalyzer. One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, …) from the text being indexed.

Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.

Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”

For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.

QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i < hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(“contents”);
	
	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer, 
                   query, “contents”, contents, 5, 100);
}

Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser. Lucene supports a sophisticated query language and QueryParser simplifies transforming the supplied string to a query object. Next, we get the top 10 results matching the query. This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.

Here is the code to generate the fragments:

TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents, 
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
		
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);		
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);

First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.

Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.

Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.

Finally, we tell the Highlighter to return the best 5 fragments.

The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java

You can also find examples of using Highlighter in the Lucene SVN Repository, specifically HighlighterTest.java

As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.

Categories: Uncategorized Tags: , ,

Extracting content from XHTML using XPATH and dom4j

October 17, 2009 2 comments

If you need to read, write, navigate, create or modify XML documents, take a look at dom4j. Browsing the dom4j cookbook and quick start guide, it seems trivial to extract content from an XML document using XPATH. Consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title/>
</head>
<body>
	<div>
		<p>Paragraph 1</p>
	</div>
	<div>
		<p>Paragraph 2</p>
	</div>
	<div>
		<p>Paragraph 3</p>
	</div>
</body>
</html>

Listing 1: Sample XML

You would think the XPATH to extract the contents of each paragraph would be

String xpathExpr = "//div/p";

If you try it out, you will see there are no matches. The gotcha is the namespace

xmlns="http://www.w3.org/1999/xhtml"

If that namspace was not specified, then the above XPATH expression would work. The solution is to use an alternate set of API calls than what is shown in the standard XPATH examples in the dom4j documentation.

package com.hrycan.blog.xml;

import java.util.List;
import java.util.Map;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.dom4j.XPath;

public class XPATHProcessor {

	private Map<String, String>  namespaceURIMap;
	
	public List<Node> extract(String content, String xpathExpr) throws DocumentException {
		Document document = DocumentHelper.parseText(content);		
		XPath path = DocumentHelper.createXPath(xpathExpr);
		path.setNamespaceURIs(namespaceURIMap);
		
		List<Node> list = path.selectNodes(document.getRootElement());
		return list;
	}

	public void setNamespaceURIMap(Map<String, String> namespaceURIMap) {
		this.namespaceURIMap = namespaceURIMap;
	}
}

Here we use the DocumentHelper object to create an XPath object and then set it’s namespace URIs map instead of using the boilerplate document.selectNodes(xpathExpr) call.

Here is the JUnit test showing how it would be used with the content of Listing1 and the corresponding XPATH expression using the namespace.

package com.hrycan.blog.xml;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.junit.Test;
import static org.junit.Assert.assertTrue;

public class XPATHProcessorTest {
	private XPATHProcessor xpathProcessor;
	
	@Test
	public void testExtract() throws DocumentException {
		xpathProcessor = new XPATHProcessor();
		Map<String, String>  namespaceURIMap = new HashMap<String, String> ();
		namespaceURIMap.put("html", "http://www.w3.org/1999/xhtml");
		xpathProcessor.setNamespaceURIMap(namespaceURIMap);
		
		String content = //content of Listing1 shown above		
		String xpathExpr = "//html:div/html:p";
		
		List<Node> list = xpathProcessor.extract(content, xpathExpr);
		assertTrue(list.size() == 3);
	}	
}
Categories: Uncategorized Tags: , , , ,

Using Regex to Sanitize and Standardize Input Fields

September 9, 2009 Leave a comment

We’ve all seen forms in web applications which require a phone number. The challenge posed by this field is there are many different ways for a user to type a valid phone number. Some examples are:

  • 602-236-2619
  • (602)2368888
  • 602.236.8888
  • 602 236 8888
  • 6022368888

The problem is we do not want the application storing the phone number in multiple formats. We want the system to use a single format. Here are some of the strategies I’ve seen to address the problem:

  • Use 3 input fields.
    Instead of a single text field, use three. Each field is meant for a different part of the phone number. Some implementations of this idea automatically tab to the next field after you enter the correct number of digits in the current field. Others force the user to tab between fields.
  • Adjust the phone number as the user types.
    This strategy uses a single text field. To match the desired format, the application modifies the user’s input as they type in order to insert the dash or parenthesis in the correct place.
  • Hope for the best.
    This strategy using a single text field and hopes the user enters the phone number matching the format show next to the text field, such as (XXX) XXX-XXXX. The validation on the server side checks for this format and displays an error to the user if it does not match, even if it is a valid phone number.

All of these techniques can lead to a sub-optimal user experience and unhappy users.

So what is a better solution to the problem? A better approach is to accept the format the user gives and change it to your internal format on the server side if it is a valid phone number.

The following code exemplifies the idea above. It is for a 10 digit US phone number. First we use a regular expression to remove all the ‘filler’ characters often found in phone numbers. Then we check to see if we have 10 digits. Next, we use another regular express to extract the logical groups of numbers found in a 10 digit phone number, for instance the area code is the first 3 digits. Finally, we take those logical groups and drop them into the corresponding slots of our format string (the value of the desiredFormat variable shown below) to transform the phone number into our system’s format.

package com.hrycan.blog;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PhoneFormatter {
	private String parsePhoneNumberRegex = "(\\d{3})(\\d{3})(\\d{4})";
	private String desiredFormat = "($1) $2-$3";

	private Pattern pattern = Pattern.compile(parsePhoneNumberRegex);

	/**
	 * Formats the supplied phone number into the desired format if
	 * it is a recognized 10-digit phone number, otherwise returns null.
	 * @param phoneNumber	phone number to be transformed
	 * @return				correctly formatted phone number, 
	 * 						otherwise null
	 */
	public String format(String phoneNumber) {
		String formattedNumber = null;

		if (phoneNumber != null) {
			//remove any non-digits
			String bareNumber = phoneNumber.replaceAll("[\\s.()-]", "");
			if (bareNumber.matches("\\d{10}")) {
				Matcher matcher = pattern.matcher(bareNumber);
				if (matcher.matches()) {
					//place the groups matched into the slots
					formattedNumber = matcher.replaceAll(desiredFormat);
				}
			}
		}

		return formattedNumber;
	}
}

Below is a JUnit 4.4 test showing the code in action.

package com.hrycan.blog;

import org.junit.Test;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;

public class PhoneFormatterTest {

	@Test
	public void testFormat() {
		
		PhoneFormatter formatter = new PhoneFormatter();
		String[] good = {"602-437-1616", 
				"(602)4371616",
				"602.437.1616",
				"602 437 1616",
				"6024371616"};
		
		String[] bad = {"4371616",
				"1.602.437.1616",
				"437-1616",
				"",
				null};
		
		for (String input : good) {
			String output = formatter.format(input);
			assertNotNull(input, output);
		}
		
		for (String input : bad) {
			String output = formatter.format(input);
			assertNull(input, output);
		}
	}
}

Of course, you may need to accept more than 10 digits in a phone number, you may want to accept an alphanumeric phone number, or you may want to use a different phone number format in your system. The same concepts demonstrated can fit those situations. For example, if you wanted the phone number format to become XXX-XXX-XXXX, you would change the value of the desiredFormat variable to:

private String desiredFormat = "$1-$2-$3";

If you want to make this into a generic utility, one of the things you would want to do is pass your ‘desiredFormat’ string to an overloaded format method.

This concept of sanitizing user supplied data and standardizing it to a common format is not limited to phone numbers. It can apply to any input field where there are multiple valid input formats, such as credit card numbers and social security numbers.

Let’s Get Started!

September 8, 2009 Leave a comment

The focus of this blog is software engineering – as you can tell from that catchy tagline at the top of the page. Some of the topics covered will include system architecture, application design, java standard edition and java enterprise edition, and web application security.

My goal is to present content which can help fellow software engineers solve problems. In my experience, teaching someone a skill helps both the student and the teacher.

Categories: Uncategorized