Tuesday, May 30, 2006

Java HTML Parsing Example With htmlparser

Every week, I post javablogs top 10 most read blog entries on this blog. The reason for it was that I don't look at what's happening on the week-end and this will pickup interesting stories from the weekend, and I also don't watch javablogs everyday. Overall I find it quite good to be uptodate with interesting stuff happening on javablogs.

As mentionned in an earlier post my library of choice to do the parsing is htmlparser (on sourceforge) because it's free, open source and because I am lazy and did not want to do my own. If you know a better open source library, feel free to add a comment about it, I'll be glad to hear about it. htmlparser is not the easiest library to use, there are many entry points and it's not immediately clear which one to choose. So I post here how I used it if it can save a few minutes to people having to do this task.

  private static Entry parseEntry(String contentthrows ParserException
  {
    final Entry entry = new Entry();

    final NodeVisitor linkVisitor = new NodeVisitor() {
      
      @Override
      public void visitTag(Tag tag) {
        String name = tag.getTagName();

        if ("a".equalsIgnoreCase(name))
            {
              String hrefValue = tag.getAttribute("href");
              if (hrefValue != null && !hrefValue.startsWith("http://"))
              {
                if (!hrefValue.startsWith("/")) hrefValue = "/"+hrefValue;
                hrefValue = "http://javablogs.com"+hrefValue;
                //System.out.println("test, value="+hrefValue);
              }
              if (hrefValue != null)
              {
                hrefValue = hrefValue.replaceAll("&""&");
                tag.setAttribute("href", hrefValue);                
              }
            }
      }
    
    };
    
    NodeVisitor visitor = new NodeVisitor() {

      @Override
      public void visitTag(Tag tag) {        
        String name = tag.getTagName();
            if ("span".equalsIgnoreCase(name|| "div".equalsIgnoreCase(name))
            {              
              String classValue = tag.getAttribute("class");
//                LOGGER.debug("visittag name="+name+" class="+classValue+"children="+tag.getChildren().toHtml());
              if ("blogentrydetails".equals(classValue))
              {
                Pattern countPattern = Pattern.compile("Reads:\\s*([0-9]*)");
                Matcher matcher = countPattern.matcher(tag.getChildren().toHtml());
                if (matcher.find())
                {
                  String countStr = matcher.group(1);
                  entry.count = new Integer(countStr).intValue();
                }
                
              }
              else if ("blogentrysummary".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.description = tag.getChildren().toHtml();                 
                entry.description = entry.description.replaceAll("\\s+"" ");
              }
              else if ("blogentrytitle".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.title =tag.getChildren().toHtml()
                entry.title = entry.title.replaceAll("\\s+"" ");
              }              
            }
            
      }

    };
    Parser parser = new Parser(new Lexer(new Page(content,"UTF-8")));
    parser.visitAllNodesWith(visitor);
        if (entry.title != null)
        {
          return entry;
        }
        else return null;
  }

7 comments :

  1. Peace, guy!
    Merry Christmas and Happy New Year!
    Can you help me?
    I'm from Teresina-Piauí (Brazil), my name is Luís Augusto (Louis August?) and i'm starting using HTMLParser. I wanna know how can i create an application (using this API) that read a html document, get the tags (all of them) and search for the frequency of a given keyword in the tag (in the text between the sart tag and the end tag). I have to get the number of times that the keyword appear in the page and its location (tag).
    If you can help me, i'll be granted.
    Sorry i'm not good with english.
    God bless you!
    My email: lardpi@gmail.com

    ReplyDelete
  2. hi guys
    Let me suggest using Java Html Mozilla Parser . It is a real-world parser that is based on mozilla's parser and parses the DOM just like you would expect it would look on a browser. It's way better then any java parser i have ever seen , although it has it's own issues.
    let me know what you think , i will have a full featured web site ready in no time...

    ReplyDelete
  3. That was helpful. Thank you.

    -Eric.

    ReplyDelete
  4. Thank you for the example...
    But I have a little problem.. My IDE (NetBeans 6.7.1) doesn't see any suitable class for "final Entry entry = new Entry();"
    Is this a class of HTML Parser also?
    Cheers

    ReplyDelete
  5. Thanks, Fabien! This is very useful.

    ReplyDelete
  6. Great,

    thanks for this. It was really helpful.

    ReplyDelete
  7. Hi, thanks for the tutorial!

    I wonder where to find "Entry", I tried every import but cannot find a type that matches the parameters "description", "title" and "count". Thanks for your help.

    ReplyDelete