Home > Uncategorized > Extracting content from XHTML using XPATH and dom4j

Extracting content from XHTML using XPATH and dom4j

If you need to read, write, navigate, create or modify XML documents, take a look at dom4j. Browsing the dom4j cookbook and quick start guide, it seems trivial to extract content from an XML document using XPATH. Consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title/>
</head>
<body>
	<div>
		<p>Paragraph 1</p>
	</div>
	<div>
		<p>Paragraph 2</p>
	</div>
	<div>
		<p>Paragraph 3</p>
	</div>
</body>
</html>

Listing 1: Sample XML

You would think the XPATH to extract the contents of each paragraph would be

String xpathExpr = "//div/p";

If you try it out, you will see there are no matches. The gotcha is the namespace

xmlns="http://www.w3.org/1999/xhtml"

If that namspace was not specified, then the above XPATH expression would work. The solution is to use an alternate set of API calls than what is shown in the standard XPATH examples in the dom4j documentation.

package com.hrycan.blog.xml;

import java.util.List;
import java.util.Map;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Node;
import org.dom4j.XPath;

public class XPATHProcessor {

	private Map<String, String>  namespaceURIMap;
	
	public List<Node> extract(String content, String xpathExpr) throws DocumentException {
		Document document = DocumentHelper.parseText(content);		
		XPath path = DocumentHelper.createXPath(xpathExpr);
		path.setNamespaceURIs(namespaceURIMap);
		
		List<Node> list = path.selectNodes(document.getRootElement());
		return list;
	}

	public void setNamespaceURIMap(Map<String, String> namespaceURIMap) {
		this.namespaceURIMap = namespaceURIMap;
	}
}

Here we use the DocumentHelper object to create an XPath object and then set it’s namespace URIs map instead of using the boilerplate document.selectNodes(xpathExpr) call.

Here is the JUnit test showing how it would be used with the content of Listing1 and the corresponding XPATH expression using the namespace.

package com.hrycan.blog.xml;

import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.dom4j.DocumentException;
import org.dom4j.Node;
import org.junit.Test;
import static org.junit.Assert.assertTrue;

public class XPATHProcessorTest {
	private XPATHProcessor xpathProcessor;
	
	@Test
	public void testExtract() throws DocumentException {
		xpathProcessor = new XPATHProcessor();
		Map<String, String>  namespaceURIMap = new HashMap<String, String> ();
		namespaceURIMap.put("html", "http://www.w3.org/1999/xhtml");
		xpathProcessor.setNamespaceURIMap(namespaceURIMap);
		
		String content = //content of Listing1 shown above		
		String xpathExpr = "//html:div/html:p";
		
		List<Node> list = xpathProcessor.extract(content, xpathExpr);
		assertTrue(list.size() == 3);
	}	
}
Advertisements
Categories: Uncategorized Tags: , , , ,
  1. November 23, 2009 at 2:37 pm

    You may also want to look at vtd-xml, which is a better all-around XML parser than Dom4J

  2. George Buzoianu
    November 24, 2009 at 6:12 am

    Nick, this article is so cool! 🙂 Thanks!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: