Archive

Archive for the ‘Uncategorized’ Category

Source now available at Google Code

October 26, 2009 Leave a comment

I’ve uploaded the full source code for the Lucene Highlighter example to my project at Google Code. The code for the earlier posts will be there in the next few weeks.

You can browse the full source online or you can use subversion to checkout the code for a local copy. As an example, to checkout the lucene code at the command line:

svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/
Advertisements
Categories: Uncategorized

Lucene Highlighter HowTo

October 25, 2009 2 comments

Background
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear. Each search result displays the title, the URL, and a text fragment containing your search terms in bold.

Consider what happens when you search for ‘Apache’ at Google. Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County. The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space. Search would not be as user friendly as it is today without these fragments.

This post covers version 2.4.1 of Apache Lucene, the popular open source search engine library written in Java. It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result. The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar. The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.

If you are not familiar with Lucene, you can think of it as a library which provides

  • a way to create a search index from multiple text items
  • a way to quickly search the index and return the best matches.

A more thorough explanation of Lucene can be found at the Apache Lucene FAQ.

As an example of what the Lucene Highlighter can do, here is what appears when I search for ‘queue’ in an index of PDF documents.

e14510.pdf
Oracle Coherence Getting Started Guide
of the ways that Coherence can eliminate bottlenecks is to queue up transactions that have occurred…
duration of an item within the queue is configurable, and is referred to as the Write-Behind Delay. When data changes, it is added to the write-behind queue (if it is not already in the queue), and the queue entry is set to ripen after the configured Write-Behind Delay has passed…

The Steps
First, before you can display highlighted fragments with each search result, the text to highlight must be available. Shown below is a snippet of indexing code. We are storing the text that will be used to generate the fragment in the contents field.

Document doc = new Document();
doc.add(new Field("contents", contents, Field.Store.COMPRESS, 
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("title", bookTitle, Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filepath", f.getCanonicalPath(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, 
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);  

The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.

Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index. Lucene automatically uncompresses it when it is retrieved.

Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents. An example of an Analyzer is StandardAnalyzer. One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, …) from the text being indexed.

Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.

Last part of the indexing side is the TermVectors. From the Lucene Javadocs:
“A term vector is a list of the document’s terms and their number of occurrences in that document.”

For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed. Above, Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.

With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request. Below is a typical “Hello World” type search block.

QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i < hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get("filename");
	String contents =  doc.get(“contents”);
	
	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer, 
                   query, “contents”, contents, 5, 100);
}

Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser. Lucene supports a sophisticated query language and QueryParser simplifies transforming the supplied string to a query object. Next, we get the top 10 results matching the query. This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.

Here is the code to generate the fragments:

TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents, 
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
		
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);		
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);

First we obtain the TokenStream. The call shown above assumes term vectors were not stored in the index at index time.

Next is the SpanScorer and SimpleSpanFragmenter. These work to break the contents into 100 character fragments and rank them by relevancy. You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter. The full details can be found in the Javadocs.

Note: when indexing large files, like the full contents of PDF manuals, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value. In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments. This was not good because the match I was seeking was near the end of the contents.

Finally, we tell the Highlighter to return the best 5 fragments.

The full code for this example can be downloaded from my Google Code project. The source file that makes use of the Highlighter is HighligherUtil.java

You can also find examples of using Highlighter in the Lucene SVN Repository, specifically HighlighterTest.java

As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.

Categories: Uncategorized Tags: , ,

Extracting content from XHTML using XPATH and dom4j

October 17, 2009 2 comments

If you need to read, write, navigate, create or modify XML documents, take a look at dom4j. Browsing the dom4j cookbook and quick start guide, it seems trivial to extract content from an XML document using XPATH. Consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title/>
</head>
<body>
	<div>
		<p>Paragraph 1</p>
	</div>
	<div>
		<p>Paragraph 2</p>
	</div>
	<div>
		<p>Paragraph 3</p>
	</div>
</body>
</html>

Listing 1: Sample XML

You would think the XPATH to extract the contents of each paragraph would be

String xpathExpr = "//div/p";

If you try it out, you will see there are no matches. The gotcha is the namespace

xmlns="http://www.w3.org/1999/xhtml"

If that namspace was not specified, then the above XPATH expression would work. The solution is to use an alternate set of API calls than what is shown in the standard XPATH examples in the dom4j documentation.

package com.hrycan.blog.xml;

import java.util.List;
import java.util.Map;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.dom4j.XPath;

public class XPATHProcessor {

	private Map<String, String>  namespaceURIMap;
	
	public List<Node> extract(String content, String xpathExpr) throws DocumentException {
		Document document = DocumentHelper.parseText(content);		
		XPath path = DocumentHelper.createXPath(xpathExpr);
		path.setNamespaceURIs(namespaceURIMap);
		
		List<Node> list = path.selectNodes(document.getRootElement());
		return list;
	}

	public void setNamespaceURIMap(Map<String, String> namespaceURIMap) {
		this.namespaceURIMap = namespaceURIMap;
	}
}

Here we use the DocumentHelper object to create an XPath object and then set it’s namespace URIs map instead of using the boilerplate document.selectNodes(xpathExpr) call.

Here is the JUnit test showing how it would be used with the content of Listing1 and the corresponding XPATH expression using the namespace.

package com.hrycan.blog.xml;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.junit.Test;
import static org.junit.Assert.assertTrue;

public class XPATHProcessorTest {
	private XPATHProcessor xpathProcessor;
	
	@Test
	public void testExtract() throws DocumentException {
		xpathProcessor = new XPATHProcessor();
		Map<String, String>  namespaceURIMap = new HashMap<String, String> ();
		namespaceURIMap.put("html", "http://www.w3.org/1999/xhtml");
		xpathProcessor.setNamespaceURIMap(namespaceURIMap);
		
		String content = //content of Listing1 shown above		
		String xpathExpr = "//html:div/html:p";
		
		List<Node> list = xpathProcessor.extract(content, xpathExpr);
		assertTrue(list.size() == 3);
	}	
}
Categories: Uncategorized Tags: , , , ,

Using Regex to Sanitize and Standardize Input Fields

September 9, 2009 Leave a comment

We’ve all seen forms in web applications which require a phone number. The challenge posed by this field is there are many different ways for a user to type a valid phone number. Some examples are:

  • 602-236-2619
  • (602)2368888
  • 602.236.8888
  • 602 236 8888
  • 6022368888

The problem is we do not want the application storing the phone number in multiple formats. We want the system to use a single format. Here are some of the strategies I’ve seen to address the problem:

  • Use 3 input fields.
    Instead of a single text field, use three. Each field is meant for a different part of the phone number. Some implementations of this idea automatically tab to the next field after you enter the correct number of digits in the current field. Others force the user to tab between fields.
  • Adjust the phone number as the user types.
    This strategy uses a single text field. To match the desired format, the application modifies the user’s input as they type in order to insert the dash or parenthesis in the correct place.
  • Hope for the best.
    This strategy using a single text field and hopes the user enters the phone number matching the format show next to the text field, such as (XXX) XXX-XXXX. The validation on the server side checks for this format and displays an error to the user if it does not match, even if it is a valid phone number.

All of these techniques can lead to a sub-optimal user experience and unhappy users.

So what is a better solution to the problem? A better approach is to accept the format the user gives and change it to your internal format on the server side if it is a valid phone number.

The following code exemplifies the idea above. It is for a 10 digit US phone number. First we use a regular expression to remove all the ‘filler’ characters often found in phone numbers. Then we check to see if we have 10 digits. Next, we use another regular express to extract the logical groups of numbers found in a 10 digit phone number, for instance the area code is the first 3 digits. Finally, we take those logical groups and drop them into the corresponding slots of our format string (the value of the desiredFormat variable shown below) to transform the phone number into our system’s format.

package com.hrycan.blog;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PhoneFormatter {
	private String parsePhoneNumberRegex = "(\\d{3})(\\d{3})(\\d{4})";
	private String desiredFormat = "($1) $2-$3";

	private Pattern pattern = Pattern.compile(parsePhoneNumberRegex);

	/**
	 * Formats the supplied phone number into the desired format if
	 * it is a recognized 10-digit phone number, otherwise returns null.
	 * @param phoneNumber	phone number to be transformed
	 * @return				correctly formatted phone number, 
	 * 						otherwise null
	 */
	public String format(String phoneNumber) {
		String formattedNumber = null;

		if (phoneNumber != null) {
			//remove any non-digits
			String bareNumber = phoneNumber.replaceAll("[\\s.()-]", "");
			if (bareNumber.matches("\\d{10}")) {
				Matcher matcher = pattern.matcher(bareNumber);
				if (matcher.matches()) {
					//place the groups matched into the slots
					formattedNumber = matcher.replaceAll(desiredFormat);
				}
			}
		}

		return formattedNumber;
	}
}

Below is a JUnit 4.4 test showing the code in action.

package com.hrycan.blog;

import org.junit.Test;
import static org.junit.Assert.assertNotNull;
import static org.junit.Assert.assertNull;

public class PhoneFormatterTest {

	@Test
	public void testFormat() {
		
		PhoneFormatter formatter = new PhoneFormatter();
		String[] good = {"602-437-1616", 
				"(602)4371616",
				"602.437.1616",
				"602 437 1616",
				"6024371616"};
		
		String[] bad = {"4371616",
				"1.602.437.1616",
				"437-1616",
				"",
				null};
		
		for (String input : good) {
			String output = formatter.format(input);
			assertNotNull(input, output);
		}
		
		for (String input : bad) {
			String output = formatter.format(input);
			assertNull(input, output);
		}
	}
}

Of course, you may need to accept more than 10 digits in a phone number, you may want to accept an alphanumeric phone number, or you may want to use a different phone number format in your system. The same concepts demonstrated can fit those situations. For example, if you wanted the phone number format to become XXX-XXX-XXXX, you would change the value of the desiredFormat variable to:

private String desiredFormat = "$1-$2-$3";

If you want to make this into a generic utility, one of the things you would want to do is pass your ‘desiredFormat’ string to an overloaded format method.

This concept of sanitizing user supplied data and standardizing it to a common format is not limited to phone numbers. It can apply to any input field where there are multiple valid input formats, such as credit card numbers and social security numbers.

Let’s Get Started!

September 8, 2009 Leave a comment

The focus of this blog is software engineering – as you can tell from that catchy tagline at the top of the page. Some of the topics covered will include system architecture, application design, java standard edition and java enterprise edition, and web application security.

My goal is to present content which can help fellow software engineers solve problems. In my experience, teaching someone a skill helps both the student and the teacher.

Categories: Uncategorized