Friday, March 17, 2006

Getting Started With DOM, XOM, DOM4J by Parsing an RSS Feed - An Experience Review

Recently, I looked for a way to get info from some particular blog entries of mine on blogger.com. Blogger used to offer an XML-RPC API . They even designed a version 2 of the XML-RPC API that does not seem to have ever been put in production. Or maybe I did not manage to make it work. I had no problem to make v1 work, however. I used apache XMLRPC v2, it was very simple to use. Unfortunately information accessible through XML-RPC Blogger API was incomplete for me. Furthermore, it is very likely that this API will disappear soon as it is deprecated since 2002.

Blogger wants you to use their Atom API. It is not XML RPC anymore, you have to do the parsing by hand.

The DOM Experience

I thought "no big deal, I will use DOM for it". I don't need performance and wanted a quick way to solve my problem, plus DOM does not require X number of libraries. It was easy to use regular DOM until I was frustrated by not being able to get the <content> element full text easily as it is sometimes XML. I did not want to hand code a method to do that as I thought it should be done by the XML library.

The XOM Experience

I heard previously of a simple XML parser, efficient, with an API that had been well designed, XOM. I looked at the API, there was a toXML () method to return the node content as XML (children included), sounded good. I saw there was even XPath support, and thought, great, it will simplify my code a bit. I will get the blog entries by just querying for " /feed/entry". No luck, it did not work, it returned 0 results. So I looked for mistakes in my code, did not find obvious ones. I tried other queries like " //feed/entry" or "//entry", same thing, not the right results. There must have been something wrong in my code, or maybe the XPath engine in XOM has particular settings to deal with RSS feeds (they contain various xmlns declarations). The point is that I got frustrated, it was supposed to be very simple, and in reality, not so!

The DOM4J Experience

I had experience with Dom4j before, just once, to build XML, not to parse it. I had relatively good memories of my Dom4j experience for that so I decided to try it out on my problem. At first I found Dom4j API a bit confusing as there are so many methods on most used classes. This is because Dom4j is DOM compatible. But I quickly understand the logic of it and found some very useful methods, namely Element.elements(name) to get all children elements by name. Of course, they have an asXML() method like XOM. There is also XPath support.
I tried the XPath on Blogger RSS without success again. There really must be a trick to get it to recognize RSS. But with the elements("entry") method, I very quickly got the same with not much more code, and it worked.

so DOM vs. XOM vs. DOM4J = 0 - 0 - 1

Example Code:

SAXReader reader = new SAXReader();
Document doc = reader.read(response);
Collection posts = new ArrayList();
List entries = doc.getRootElement().elements("entry");
if (LOG.isDebugEnabled())
{
LOG.debug("found "+entries.size()+" entries");
}
for (int i = 0; i <entries.size();i++)
{
Element entry = (Element) entries.get(i);
Map m = new HashMap();
for (Iterator it = entry.elementIterator();it.hasNext();)
{
Element detail = (Element) it.next();
String name = detail.getName();
if (name.equals("link"))
{
m.put("link",detail.attribute("href").getValue());
}
else if (name.equals("content"))
{
m.put("content",detail.asXML());
}
else
{
m.put(name,detail.getTextTrim());
}
}

posts.add(m);
if (LOG.isDebugEnabled())
{
LOG.debug("found="+m.get("title")+", url="+m.get("link"));
}
}

4 comments :

  1. From the sound of it you might be better off with XOM. It is really meant for processing XML, whereas I believe DOM4J is meant to create DOM (xml) from HTML.

    I have been using Nux, a seemlingly nice XOM framework. Easy examples.

    I suggest looking at the first example, extracting titles from Tim Bray's blog using XOM. It sounds like your problem is missing namespaces (and the resulting node prefixes).

    Looking at the Nux example, you must put:
    declare namespace atom = "http://www.w3.org/2005/Atom"; in the query.

    Here's a little test in XOM, try: /*:feed/*:entry

    This will ignore the namespaces and simply return all the nodes, regardless of the namespace prefix.

    Maybe that will help... I am using Nux with TagSoup to get info from HTML pages. Be forewarned, TagSoup converts the doc to the xhtml namespace automatically -- and if you want your xpath queries to work, this is critical knowledge. They don't mention this in the example... using the wildcards instead.

    ReplyDelete
  2. Thanks for your detailed reply, DOM4J is not from creating XML from HTML. It actually shares the same roots as XOM: JDOM.

    I will try when I have time your trick to see if it was a namespace related problem. But I think you are right on this.

    ReplyDelete
  3. hi,
    i m going to start my work in NUX , i am writing programme but want to know how to compile and run that program using NUX, i have Installed it and added the class paths... Please help me in this context

    ReplyDelete
  4. Sorry for coming late to the party -- I was just searching for some XOM reviews regarding xpath (since it does behave a bit differently from Dom4j, and I'm not sure which is right) and it sounds like in this case, XOM wasn't any worse than Dom4j, but it got a slightly more snarky treatment.

    Here's what I mean: You tried to use XPath in XOM, and it didn't work, so blah. Then you tried Dom4j's XPath, and that also gave you trouble -- but then you also tried a different method that's also available in XOM, and it worked in Dom4j so you stuck with it.

    Please let me know if I'm reading this incorrectly or unfairly. :)

    ReplyDelete