Monday, September 12, 2005

JavaBlogs Weekly Top 10 and Java HTML parsing

I took some time to continue my little JavaBlogs analysis, I now have a page summarizing the top 10 most read blog entries in the last week. The page is generated every 24h (this is why there is no 'best progression' as of today).

I also fixed some bugs related to HTML in RSS2. I understand a bit better why a RSS 1.0 co-author decided to remove the possibility of HTML descriptions for RSS 3.0. It often does not make sense to keep all that information about styles, fonts, etc. from different sources. What I do is rewrite the HTML, allowing only b,i,a,p,br tags, with the style information stripped. I found the open-source htmlparser java library quite helpful to achieve that.

Mai 2006 Update
I now post the top10 every week to my blog, I wrote a little piece on how to interact with Blogger API in Java. This avoids me having to maintain an extra site.
HTTP requests handling is done using commons-httpclient library to have more control over how http requests to are performed. commons-httpclient is also useful to post to blogger. About the parsing with htmlparser, I changed the way to do it, I used to only use a simple Lexer, but now I changed to using NodeVisitor as it allows me to parse with finer granularity more easily, even though it is probably slower. I needed that to update href elements to that they are XHTML compliant.

You will find concrete code using htmlparser in a more recent post, just follow the link.

Categories: , ,