Chase The Devil: Java HTML Parsing Example With htmlparser

Tuesday, May 30, 2006

Java HTML Parsing Example With htmlparser

Every week, I post javablogs top 10 most read blog entries on this blog. The reason for it was that I don't look at what's happening on the week-end and this will pickup interesting stories from the weekend, and I also don't watch javablogs everyday. Overall I find it quite good to be uptodate with interesting stuff happening on javablogs.

As mentionned in an earlier post my library of choice to do the parsing is htmlparser (on sourceforge) because it's free, open source and because I am lazy and did not want to do my own. If you know a better open source library, feel free to add a comment about it, I'll be glad to hear about it. htmlparser is not the easiest library to use, there are many entry points and it's not immediately clear which one to choose. So I post here how I used it if it can save a few minutes to people having to do this task.

  private static Entry parseEntry(String content) throws ParserException
  {
    final Entry entry = new Entry();

    final NodeVisitor linkVisitor = new NodeVisitor() {
      
      @Override
      public void visitTag(Tag tag) {
        String name = tag.getTagName();

        if ("a".equalsIgnoreCase(name))
            {
              String hrefValue = tag.getAttribute("href");
              if (hrefValue != null && !hrefValue.startsWith("http://"))
              {
                if (!hrefValue.startsWith("/")) hrefValue = "/"+hrefValue;
                hrefValue = "http://javablogs.com"+hrefValue;
                //System.out.println("test, value="+hrefValue);
              }
              if (hrefValue != null)
              {
                hrefValue = hrefValue.replaceAll("&", "&amp;");
                tag.setAttribute("href", hrefValue);                
              }
            }
      }
    
    };
    
    NodeVisitor visitor = new NodeVisitor() {

      @Override
      public void visitTag(Tag tag) {        
        String name = tag.getTagName();
            if ("span".equalsIgnoreCase(name) || "div".equalsIgnoreCase(name))
            {              
              String classValue = tag.getAttribute("class");
//                LOGGER.debug("visittag name="+name+" class="+classValue+"children="+tag.getChildren().toHtml());
              if ("blogentrydetails".equals(classValue))
              {
                Pattern countPattern = Pattern.compile("Reads:\\s*([0-9]*)");
                Matcher matcher = countPattern.matcher(tag.getChildren().toHtml());
                if (matcher.find())
                {
                  String countStr = matcher.group(1);
                  entry.count = new Integer(countStr).intValue();
                }
                
              }
              else if ("blogentrysummary".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.description = tag.getChildren().toHtml();                 
                entry.description = entry.description.replaceAll("\\s+", " ");
              }
              else if ("blogentrytitle".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.title =tag.getChildren().toHtml(); 
                entry.title = entry.title.replaceAll("\\s+", " ");
              }              
            }
            
      }

    };
    Parser parser = new Parser(new Lexer(new Page(content,"UTF-8")));
    parser.visitAllNodesWith(visitor);
        if (entry.title != null)
        {
          return entry;
        }
        else return null;
  }

7 comments :

AnonymousJanuary 05, 2007 6:01 AM
Peace, guy!
Merry Christmas and Happy New Year!
Can you help me?
I'm from Teresina-Piauí (Brazil), my name is Luís Augusto (Louis August?) and i'm starting using HTMLParser. I wanna know how can i create an application (using this API) that read a html document, get the tags (all of them) and search for the frequency of a given keyword in the tag (in the text between the sart tag and the end tag). I have to get the number of times that the keyword appear in the page and its location (tag).
If you can help me, i'll be granted.
Sorry i'm not good with english.
God bless you!
My email: lardpi@gmail.com
ReplyDelete
Replies
Ohad SerfatyJanuary 16, 2007 8:13 PM
hi guys
Let me suggest using Java Html Mozilla Parser . It is a real-world parser that is based on mozilla's parser and parses the DOM just like you would expect it would look on a browser. It's way better then any java parser i have ever seen , although it has it's own issues.
let me know what you think , i will have a full featured web site ready in no time...
ReplyDelete
Replies
AnonymousFebruary 24, 2009 9:44 PM
That was helpful. Thank you.

-Eric.
ReplyDelete
Replies
iulianOctober 27, 2009 8:58 AM
Thank you for the example...
But I have a little problem.. My IDE (NetBeans 6.7.1) doesn't see any suitable class for "final Entry entry = new Entry();"
Is this a class of HTML Parser also?
Cheers
ReplyDelete
Replies
duncanApril 22, 2010 10:12 PM
Thanks, Fabien! This is very useful.
ReplyDelete
Replies
SantiagoOctober 12, 2010 9:11 PM
Great,

thanks for this. It was really helpful.
ReplyDelete
Replies
padmalcomJune 14, 2011 1:35 PM
Hi, thanks for the tutorial!

I wonder where to find "Entry", I tried every import but cannot find a type that matches the parameters "description", "title" and "count". Thanks for your help.
ReplyDelete
Replies

Add comment