Tuesday, May 30, 2006

Java HTML Parsing Example With htmlparser

Every week, I post javablogs top 10 most read blog entries on this blog. The reason for it was that I don't look at what's happening on the week-end and this will pickup interesting stories from the weekend, and I also don't watch javablogs everyday. Overall I find it quite good to be uptodate with interesting stuff happening on javablogs.

As mentionned in an earlier post my library of choice to do the parsing is htmlparser (on sourceforge) because it's free, open source and because I am lazy and did not want to do my own. If you know a better open source library, feel free to add a comment about it, I'll be glad to hear about it. htmlparser is not the easiest library to use, there are many entry points and it's not immediately clear which one to choose. So I post here how I used it if it can save a few minutes to people having to do this task.

  private static Entry parseEntry(String contentthrows ParserException
  {
    final Entry entry = new Entry();

    final NodeVisitor linkVisitor = new NodeVisitor() {
      
      @Override
      public void visitTag(Tag tag) {
        String name = tag.getTagName();

        if ("a".equalsIgnoreCase(name))
            {
              String hrefValue = tag.getAttribute("href");
              if (hrefValue != null && !hrefValue.startsWith("http://"))
              {
                if (!hrefValue.startsWith("/")) hrefValue = "/"+hrefValue;
                hrefValue = "http://javablogs.com"+hrefValue;
                //System.out.println("test, value="+hrefValue);
              }
              if (hrefValue != null)
              {
                hrefValue = hrefValue.replaceAll("&""&");
                tag.setAttribute("href", hrefValue);                
              }
            }
      }
    
    };
    
    NodeVisitor visitor = new NodeVisitor() {

      @Override
      public void visitTag(Tag tag) {        
        String name = tag.getTagName();
            if ("span".equalsIgnoreCase(name|| "div".equalsIgnoreCase(name))
            {              
              String classValue = tag.getAttribute("class");
//                LOGGER.debug("visittag name="+name+" class="+classValue+"children="+tag.getChildren().toHtml());
              if ("blogentrydetails".equals(classValue))
              {
                Pattern countPattern = Pattern.compile("Reads:\\s*([0-9]*)");
                Matcher matcher = countPattern.matcher(tag.getChildren().toHtml());
                if (matcher.find())
                {
                  String countStr = matcher.group(1);
                  entry.count = new Integer(countStr).intValue();
                }
                
              }
              else if ("blogentrysummary".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.description = tag.getChildren().toHtml();                 
                entry.description = entry.description.replaceAll("\\s+"" ");
              }
              else if ("blogentrytitle".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.title =tag.getChildren().toHtml()
                entry.title = entry.title.replaceAll("\\s+"" ");
              }              
            }
            
      }

    };
    Parser parser = new Parser(new Lexer(new Page(content,"UTF-8")));
    parser.visitAllNodesWith(visitor);
        if (entry.title != null)
        {
          return entry;
        }
        else return null;
  }

11 comments :

  1. I like the Quiotix Parser by Brian Goetz. There seems to be maintenance up through a year ago. It parses to a DOM and provides a visitor interface. Provided visitors can rewrite the HTML as-is or pretty printed. Here's a little sample of one of my utilities (and some of my very earliest Java)

    public void scanFile( File file )
    {
    boolean modified = false;

    Logger.log( "htmhtmlTool01", "HtmhtmlTool.scanFile FileId=" + file );
    scanCount++;
    try
    {
    Htmparser parser = new Htmparser(
    new BufferedReader(
    new FileReader( file )));
    Htmdocument doc = parser.Htmdocument();

    doc.accept( new HtmlScrubber() );
    doc.accept( new HtmlCollector() );
    if ( doDebug1 ) doc.accept( new HtmlDebugger() );

    if ( doLinkList )
    {
    linkFinder.setDocumentName( file + "" );
    doc.accept( linkFinder );
    }

    if ( doAutoToc)
    {
    doc.accept( autoToc );
    if ( autoToc.wasModified() )
    {
    Logger.log( "HtmhtmlTool.AutoToc modified " + file );
    modified = true;
    }
    }

    if ( doNavWizard )
    {
    navWizard.setDocumentName( file + "" );
    doc.accept( navWizard );
    if ( navWizard.wasModified() )
    {
    Logger.log( "HtmhtmlTool.NavWizard modified " + file );
    modified = true;
    }
    }

    if ( doDebug2 ) doc.accept( new HtmlDumper( System.out ) );

    if ( rewrite & modified )
    {
    Logger.log( "HtmhtmlTool.scanFile Rewriting " + file );
    doc.accept( new HtmlDumper(
    new BufferedOutputStream(
    new FileOutputStream( file ) ) ) );
    }
    }
    catch ( Exception e )
    {
    System.out.println("Exception " + e );
    e.printStackTrace( );
    }
    }

    ReplyDelete
  2. Peace, guy!
    Merry Christmas and Happy New Year!
    Can you help me?
    I'm from Teresina-Piauí (Brazil), my name is Luís Augusto (Louis August?) and i'm starting using HTMLParser. I wanna know how can i create an application (using this API) that read a html document, get the tags (all of them) and search for the frequency of a given keyword in the tag (in the text between the sart tag and the end tag). I have to get the number of times that the keyword appear in the page and its location (tag).
    If you can help me, i'll be granted.
    Sorry i'm not good with english.
    God bless you!
    My email: lardpi@gmail.com

    ReplyDelete
  3. hi guys
    Let me suggest using Java Html Mozilla Parser . It is a real-world parser that is based on mozilla's parser and parses the DOM just like you would expect it would look on a browser. It's way better then any java parser i have ever seen , although it has it's own issues.
    let me know what you think , i will have a full featured web site ready in no time...

    ReplyDelete
  4. That was helpful. Thank you.

    -Eric.

    ReplyDelete
  5. Hi Stan!
    first off, thanks for the tutorial!
    I'm trying to use the htmlparser, but I'm stuck with an issue here.
    I was wondering if it is possible to remove certain tags from a content. For instance, replace all links on a page with just their text values, etc.
    I might do it using a pattern + matcher, but would be better if I could use the htmlparser.
    Really appreciate and hope to hear from you soon!

    King regards,
    Ikrom (email: icky[at]inbox.ru)

    ReplyDelete
  6. I am using htmlparser (htmlparser.org) to re-write all the link's in a input String.

    All i need to do is iterate over all the link tags (

    I am not sure how exactly I can update only the select link elements in the input String, will leaving all other data in the input String untouched.

    It seems like the htmlparser library can extract certain elements for manipulation but it can't manipulate elements in their original context, and the then return their updated values will maintaining the integrity of the original context.

    Any help would be greatly apprecaited.

    Thanks

    ReplyDelete
  7. Thank you for the example...
    But I have a little problem.. My IDE (NetBeans 6.7.1) doesn't see any suitable class for "final Entry entry = new Entry();"
    Is this a class of HTML Parser also?
    Cheers

    ReplyDelete
  8. Thanks, Fabien! This is very useful.

    ReplyDelete
  9. Great,

    thanks for this. It was really helpful.

    ReplyDelete
  10. Hi, thanks for the tutorial!

    I wonder where to find "Entry", I tried every import but cannot find a type that matches the parameters "description", "title" and "count". Thanks for your help.

    ReplyDelete
  11. Miren este ejemplo de lectores de RSS esta muy buena estas libreria(ROME)

    import java.net.HttpURLConnection;
    import java.net.InetSocketAddress;
    import java.net.Proxy;
    import java.net.SocketAddress;
    import java.net.URL;
    import java.util.Iterator;
    import java.util.List;

    import com.sun.syndication.feed.synd.SyndEntry;
    import com.sun.syndication.feed.synd.SyndFeed;
    import com.sun.syndication.io.SyndFeedInput;
    import com.sun.syndication.io.XmlReader;

    public class RomeLibraryExample {
    public static void main(String[] args) throws Exception {
    //URL url = new URL("http://micubitas.wordpress.com/comments/feed/");
    //http://lamaquinadiferencial.wordpress.com/about/
    //http://micubitas.wordpress.com/feed/
    URL url = new URL("http://lamaquinadiferencial.wordpress.com/about/feed/");

    SocketAddress ipPuerto=new InetSocketAddress("localhost", 3128);
    Proxy proxlocal =new Proxy(Proxy.Type.HTTP, ipPuerto);

    HttpURLConnection httpcon = (HttpURLConnection)url.openConnection(proxlocal);
    // Reading the feed
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new XmlReader(httpcon));

    System.out.println("Author: "+feed.getAuthor());
    System.out.println("Copyright: "+feed.getCopyright());
    System.out.println("Description: "+feed.getDescription());
    System.out.println("Encoding: "+feed.getEncoding());
    System.out.println("FeedType: "+feed.getFeedType());
    System.out.println("Language: "+feed.getLanguage());
    System.out.println("Link: "+feed.getLink());
    System.out.println("Title: "+feed.getTitle());
    System.out.println("Uri: "+feed.getUri());
    System.out.println("========================");

    List entries = feed.getEntries();
    Iterator itEntries = entries.iterator();

    while (itEntries.hasNext()) {
    SyndEntry entry = (SyndEntry) itEntries.next();
    System.out.println("Title: " + entry.getTitle());
    System.out.println("Link: " + entry.getLink());
    System.out.println("Author: " + entry.getAuthor());
    System.out.println("Publish Date: " + entry.getPublishedDate());
    System.out.println("Description: " + entry.getDescription().getValue());
    System.out.println("========================");
    System.out.println();
    }
    }
    }

    ReplyDelete