Friday, March 17, 2006

Getting Started With DOM, XOM, DOM4J by Parsing an RSS Feed - An Experience Review

Recently, I looked for a way to get info from some particular blog entries of mine on blogger.com. Blogger used to offer an XML-RPC API . They even designed a version 2 of the XML-RPC API that does not seem to have ever been put in production. Or maybe I did not manage to make it work. I had no problem to make v1 work, however. I used apache XMLRPC v2, it was very simple to use. Unfortunately information accessible through XML-RPC Blogger API was incomplete for me. Furthermore, it is very likely that this API will disappear soon as it is deprecated since 2002.

Blogger wants you to use their Atom API. It is not XML RPC anymore, you have to do the parsing by hand.

The DOM Experience

I thought "no big deal, I will use DOM for it". I don't need performance and wanted a quick way to solve my problem, plus DOM does not require X number of libraries. It was easy to use regular DOM until I was frustrated by not being able to get the <content> element full text easily as it is sometimes XML. I did not want to hand code a method to do that as I thought it should be done by the XML library.

The XOM Experience

I heard previously of a simple XML parser, efficient, with an API that had been well designed, XOM. I looked at the API, there was a toXML () method to return the node content as XML (children included), sounded good. I saw there was even XPath support, and thought, great, it will simplify my code a bit. I will get the blog entries by just querying for " /feed/entry". No luck, it did not work, it returned 0 results. So I looked for mistakes in my code, did not find obvious ones. I tried other queries like " //feed/entry" or "//entry", same thing, not the right results. There must have been something wrong in my code, or maybe the XPath engine in XOM has particular settings to deal with RSS feeds (they contain various xmlns declarations). The point is that I got frustrated, it was supposed to be very simple, and in reality, not so!

The DOM4J Experience

I had experience with Dom4j before, just once, to build XML, not to parse it. I had relatively good memories of my Dom4j experience for that so I decided to try it out on my problem. At first I found Dom4j API a bit confusing as there are so many methods on most used classes. This is because Dom4j is DOM compatible. But I quickly understand the logic of it and found some very useful methods, namely Element.elements(name) to get all children elements by name. Of course, they have an asXML() method like XOM. There is also XPath support.
I tried the XPath on Blogger RSS without success again. There really must be a trick to get it to recognize RSS. But with the elements("entry") method, I very quickly got the same with not much more code, and it worked.

so DOM vs. XOM vs. DOM4J = 0 - 0 - 1

Example Code:

SAXReader reader = new SAXReader();
Document doc = reader.read(response);
Collection posts = new ArrayList();
List entries = doc.getRootElement().elements("entry");
if (LOG.isDebugEnabled())
{
LOG.debug("found "+entries.size()+" entries");
}
for (int i = 0; i <entries.size();i++)
{
Element entry = (Element) entries.get(i);
Map m = new HashMap();
for (Iterator it = entry.elementIterator();it.hasNext();)
{
Element detail = (Element) it.next();
String name = detail.getName();
if (name.equals("link"))
{
m.put("link",detail.attribute("href").getValue());
}
else if (name.equals("content"))
{
m.put("content",detail.asXML());
}
else
{
m.put(name,detail.getTextTrim());
}
}

posts.add(m);
if (LOG.isDebugEnabled())
{
LOG.debug("found="+m.get("title")+", url="+m.get("link"));
}
}

1 comment :

  1. hi,
    i m going to start my work in NUX , i am writing programme but want to know how to compile and run that program using NUX, i have Installed it and added the class paths... Please help me in this context

    ReplyDelete