Archive

Archive for October, 2009

Source now available at Google Code

October 26, 2009 Leave a comment

I’ve uploaded the full source code for the Lucene Highlighter example to my project at Google Code. The code for the earlier posts will be there in the next few weeks.

You can browse the full source online or you can use subversion to checkout the code for a local copy. As an example, to checkout the lucene code at the command line:

svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/
Categories: Uncategorized

Lucene Highlighter HowTo

October 25, 2009 2 comments

Background
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear. Each search result displays the title, the URL, and a text fragment containing your search terms in bold.

Consider what happens when you search for ‘Apache’ at Google. Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County. The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space. Search would not be as user friendly as it is today without these fragments.

This post covers version 2.4.1 of Apache Lucene, the popular open source search engine library written in Java. It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result. The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar. The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.

If you are not familiar with Lucene, you can think of it as a library which provides

  • a way to create a search index from multiple text items
  • a way to quickly search the index and return the best matches.

A more thorough explanation of Lucene can be found at the Apache Lucene FAQ.

As an example of what the Lucene Highlighter can do, here is what appears when I search for ‘queue’ in an index of PDF documents.

e14510.pdf
Oracle Coherence Getting Started Guide
of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred…
duration of an item within the queue is configurable, and is referred to as the Write-Behind Delay. When data changes, it is added to the write-behind queue (if it is not already in the queue), and the queue entry is set to ripen after the configured Write-Behind Delay has passed…

The Steps
First, before you can display highlighted fragments with each search result, the text to highlight must be available. Shown below is a snippet of indexing code. We are storing the text that will be used to generate the fragment in the contents field.

Document doc = new Document();
doc.add(new Field("contents", contents, Field.Store.COMPRESS, 
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("title", bookTitle, Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filepath", f.getCanonicalPath(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);  

The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.

Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index. Lucene automatically uncompresses it when it is retrieved.

Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents. An example of an Analyzer is StandardAnalyzer. One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, …) from the text being indexed.

Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.

Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”

For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.

QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i < hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(“contents”);
	
	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer, 
                   query, “contents”, contents, 5, 100);
}

Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser. Lucene supports a sophisticated query language and QueryParser simplifies transforming the supplied string to a query object. Next, we get the top 10 results matching the query. This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.

Here is the code to generate the fragments:

TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents, 
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
		
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);		
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);

First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.

Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.

Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.

Finally, we tell the Highlighter to return the best 5 fragments.

The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java

You can also find examples of using Highlighter in the Lucene SVN Repository, specifically HighlighterTest.java

As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.

Categories: Uncategorized Tags: , ,

Extracting content from XHTML using XPATH and dom4j

October 17, 2009 2 comments

If you need to read, write, navigate, create or modify XML documents, take a look at dom4j. Browsing the dom4j cookbook and quick start guide, it seems trivial to extract content from an XML document using XPATH. Consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title/>
</head>
<body>
	<div>
		<p>Paragraph 1</p>
	</div>
	<div>
		<p>Paragraph 2</p>
	</div>
	<div>
		<p>Paragraph 3</p>
	</div>
</body>
</html>

Listing 1: Sample XML

You would think the XPATH to extract the contents of each paragraph would be

String xpathExpr = "//div/p";

If you try it out, you will see there are no matches. The gotcha is the namespace

xmlns="http://www.w3.org/1999/xhtml"

If that namspace was not specified, then the above XPATH expression would work. The solution is to use an alternate set of API calls than what is shown in the standard XPATH examples in the dom4j documentation.

package com.hrycan.blog.xml;

import java.util.List;
import java.util.Map;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.dom4j.XPath;

public class XPATHProcessor {

	private Map<String, String>  namespaceURIMap;
	
	public List<Node> extract(String content, String xpathExpr) throws DocumentException {
		Document document = DocumentHelper.parseText(content);		
		XPath path = DocumentHelper.createXPath(xpathExpr);
		path.setNamespaceURIs(namespaceURIMap);
		
		List<Node> list = path.selectNodes(document.getRootElement());
		return list;
	}

	public void setNamespaceURIMap(Map<String, String> namespaceURIMap) {
		this.namespaceURIMap = namespaceURIMap;
	}
}

Here we use the DocumentHelper object to create an XPath object and then set it’s namespace URIs map instead of using the boilerplate document.selectNodes(xpathExpr) call.

Here is the JUnit test showing how it would be used with the content of Listing1 and the corresponding XPATH expression using the namespace.

package com.hrycan.blog.xml;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.junit.Test;
import static org.junit.Assert.assertTrue;

public class XPATHProcessorTest {
	private XPATHProcessor xpathProcessor;
	
	@Test
	public void testExtract() throws DocumentException {
		xpathProcessor = new XPATHProcessor();
		Map<String, String>  namespaceURIMap = new HashMap<String, String> ();
		namespaceURIMap.put("html", "http://www.w3.org/1999/xhtml");
		xpathProcessor.setNamespaceURIMap(namespaceURIMap);
		
		String content = //content of Listing1 shown above		
		String xpathExpr = "//html:div/html:p";
		
		List<Node> list = xpathProcessor.extract(content, xpathExpr);
		assertTrue(list.size() == 3);
	}	
}
Categories: Uncategorized Tags: , , , ,