Tuesday, May 30, 2006

Java HTML Parsing Example With htmlparser

Every week, I post javablogs top 10 most read blog entries on this blog. The reason for it was that I don't look at what's happening on the week-end and this will pickup interesting stories from the weekend, and I also don't watch javablogs everyday. Overall I find it quite good to be uptodate with interesting stuff happening on javablogs.

As mentionned in an earlier post my library of choice to do the parsing is htmlparser (on sourceforge) because it's free, open source and because I am lazy and did not want to do my own. If you know a better open source library, feel free to add a comment about it, I'll be glad to hear about it. htmlparser is not the easiest library to use, there are many entry points and it's not immediately clear which one to choose. So I post here how I used it if it can save a few minutes to people having to do this task.

  private static Entry parseEntry(String contentthrows ParserException
  {
    final Entry entry = new Entry();

    final NodeVisitor linkVisitor = new NodeVisitor() {
      
      @Override
      public void visitTag(Tag tag) {
        String name = tag.getTagName();

        if ("a".equalsIgnoreCase(name))
            {
              String hrefValue = tag.getAttribute("href");
              if (hrefValue != null && !hrefValue.startsWith("http://"))
              {
                if (!hrefValue.startsWith("/")) hrefValue = "/"+hrefValue;
                hrefValue = "http://javablogs.com"+hrefValue;
                //System.out.println("test, value="+hrefValue);
              }
              if (hrefValue != null)
              {
                hrefValue = hrefValue.replaceAll("&""&");
                tag.setAttribute("href", hrefValue);                
              }
            }
      }
    
    };
    
    NodeVisitor visitor = new NodeVisitor() {

      @Override
      public void visitTag(Tag tag) {        
        String name = tag.getTagName();
            if ("span".equalsIgnoreCase(name|| "div".equalsIgnoreCase(name))
            {              
              String classValue = tag.getAttribute("class");
//                LOGGER.debug("visittag name="+name+" class="+classValue+"children="+tag.getChildren().toHtml());
              if ("blogentrydetails".equals(classValue))
              {
                Pattern countPattern = Pattern.compile("Reads:\\s*([0-9]*)");
                Matcher matcher = countPattern.matcher(tag.getChildren().toHtml());
                if (matcher.find())
                {
                  String countStr = matcher.group(1);
                  entry.count = new Integer(countStr).intValue();
                }
                
              }
              else if ("blogentrysummary".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.description = tag.getChildren().toHtml();                 
                entry.description = entry.description.replaceAll("\\s+"" ");
              }
              else if ("blogentrytitle".equals(classValue))
              {
                try
                {
                  tag.getChildren().visitAllNodesWith(linkVisitor);
                }
                catch (ParserException pe)
                {
                  LOGGER.error(pe,pe);
                }
                entry.title =tag.getChildren().toHtml()
                entry.title = entry.title.replaceAll("\\s+"" ");
              }              
            }
            
      }

    };
    Parser parser = new Parser(new Lexer(new Page(content,"UTF-8")));
    parser.visitAllNodesWith(visitor);
        if (entry.title != null)
        {
          return entry;
        }
        else return null;
  }

Top 10 Most Read Last Week On Javablogs.com, Week 21


Most read last week

  1. Spring vs JBoss, and why I don’t care about Sun standards (272): After a long time, it was interesting to see the Spring and JBoss folks engage in a public war of words, in comments on Matt Raible’s blog. [read]

  2. Kent Beck: "We thought we were just programming on an airplane" (231): JUnit co-creator Kent Beck says a number of things convinced he and Erich Gamma to create a new revision of JUnit after a long hiatus, including TestNG and Java 5. Last week at JavaOne, [read]

  3. Where are you, Project Manager with Technical Skills? (204): In Spain we are facing again a lack of workers with experience in development of not-so-cutting-edge technologies like J2EE. So, [read]

  4. Thanks... and good luck Bruce! (203): It is unfortunate that Bruce Tate forgot to enable comments to his final blog entry. It would be a shame to see him off without at least a small well-wishing. (possibly a little roast too ;-) [read]

  5. Google Web Toolkit Angst (202): I've been using Google Web Toolkit for the last week or so. I'm really liking it, it is really productive and once you getting it working everything is sweet. The problem is, [read]

  6. Is this simpler than Hibernate? (193): In an earlier blog entry I described an early cut of DynaModel, Slingshot's persistence engine. [read]

  7. Article: Don't repeat the DAO! : Build a generic typesafe DAO with Hibernate and Spring AOP (192): Don't repeat the DAO! : Build a generic typesafe DAO with Hibernate and Spring AOP is a developerWorks article by Per Mellqvist which presents a generic DAO implementation class based on Hibernate, [read]

  8. Why ORM Tools are Not Recommended (185): Sandeep Sha has written an a forum posting by Why ORM Tools are Not Recommended that has some interesting points. Although I do not agree with all the points, [read]

  9. The Dojo Toolkit in Practice (185): We have posted a new article on using the Dojo Toolkit in a project. The article discusses a piece of a project that uses Ajax to create a responsive itinerary viewer. [read]


Most read last week-end

  1. Spring vs JBoss, and why I don’t care about Sun standards (272): After a long time, it was interesting to see the Spring and JBoss folks engage in a public war of words, in comments on Matt Raible’s blog. [read]

  2. Thanks... and good luck Bruce! (203): It is unfortunate that Bruce Tate forgot to enable comments to his final blog entry. It would be a shame to see him off without at least a small well-wishing. (possibly a little roast too ;-) [read]

  3. Is this simpler than Hibernate? (193): In an earlier blog entry I described an early cut of DynaModel, Slingshot's persistence engine. [read]

  4. What’s Up With Huge Resumes? (150): What’s up with huge resumes these days? The company I work for has been hiring lately and so I usually end up interviewing one to two people a week. [read]

  5. Introducing jvm-languages.com (147): Back in September of 2004, I tried to write a book. It would have been called Dynamic Languages and Java. Unfortunately, I never completed it. [read]

  6. Comparison Between PMD vs Findbugs vs Hammurapi (135): Take a look at this one the differences between these three tools Differences [read]

  7. Then God said let there be Ubuntu... ahem (130): Finally I got a version of Linux, which works as good as XP or even better ;) ; using which I can get to do my work seamlessly. Its none other than Ubuntu Dapper. [read]

  8. Job Trend, Not Google Trend (121): Wanna know the amount of Java jobs versus .Net jobs, or the growth of AJAX jobs? Google Trend may be able to help you a bit, but the result is not scoped for jobs only. Indeed. [read]

  9. 1-Minute Quiz: Why is Hyphen Illegal in Identifier? (110): Why is hyphen (-) an illegal char in Java identifier? Why can't we use variable names like first-name, as we do in xml files? The answer to this question is not hard, but the challenge is, [read]


Monday, May 22, 2006

Top 10 Most Read Last Week On Javablogs.com, Week 20


Most read last week

  1. The Worst Java Job Interview Questions. (269): Why are you looking for a job? Strictly speaking, this is not a java question, but it shows up in almost every job interview I've been to. [read]

  2. Goodbye Ant , Welcome Maven 2 (219): After years of using Ant for building my applications, I have moved to something different, Apache Maven 2. And now it seems there is no looking back. [read]

  3. Google Web Toolkit: A Brief Review (219): Google has released GWT - a java window toolkit which converts your java applications (using the toolkit API) to javascript (incl. AJAX) and HTML. [read]

  4. A *bold* paper against Threads (214): Edward A. Lee wrote a paper called "The Problem with Threads", you can find his pdf paper here. There is no rant here but facts, and sound reasoning. [read]

  5. Outsourcing your code is so cheap ... but why are so many jobs coming back from their indian trip ? (202): There are websites where you can get very cheap developpers, here are the one I know: http://www.getacoder.com/ http://www.rentacoder.com/ http://www.getafreelancer.com/ http://www. [read]

  6. Signs You're a Crappy Programmer (and don't know it) (190): Please read this great post from Damien Katz, and watch the signs Java is all you'll ever need. "Enterprisey" isn't a punchline to you. [read]

  7. Google Web Toolkit: Web Applications Just Got Harder (182): Oh the buzz. Oh the excitement. Oh the AJaX Gods has released their secret sauce with an Apache license. Google Web Toolkit allows one to develop AJaX web applications entirely in Java, [read]

  8. PDFs available for JavaOne 2006 Sessions (177): Check out the JavaOne 2006 Conference Session Catalog: “Presentation files available for download are indicated with a paperclip icon. After clicking on a paperclip, [read]

  9. Google Web Toolkit for building AJAX apps in Java (173): Google has introduced a toolkit for building AJAX applications in Java, though its in beta. It has also supplied some sample applications with the kit. [read]


Most read last week-end

  1. PDFs available for JavaOne 2006 Sessions (177): Check out the JavaOne 2006 Conference Session Catalog: “Presentation files available for download are indicated with a paperclip icon. After clicking on a paperclip, [read]

  2. Cringely: Why IBM Is in Trouble (159): Robert X. Cringley doesnt have a high opinion of IBM. Last week, he wrote, ...what is IBM? IBM is a disaster-in-the-making. [read]

  3. JavaOne Gossip: NetBeans Pulls a Prank on Eclipse (147): Humor makes life fun. Life just got a lot funnier. For some I guess. netBeans - Eclipse 1-0. Post your suggestions on how Eclipse should get even. [read]

  4. Day 5: McNealy, Gosling, Gage: "Forget the box" (139): With a mixture of sadness, relief, and hope for the future, former Sun CEO Scott McNealy took the stage this morning at the final keynote address of JavaOne 2006. [read]

  5. Project Harmony gets AWT/Swing Contrib from Intel (127): This may be a bit late but at JavaONE this year JEdit was shown running on the AWT/Swing contribution that Intel gave to Project Harmony. [read]

  6. Java 7.0 (Dolphin): Evolving in the Ecosystem (121): Sun developer Danny Coward says "Compatibility is king", but Sun is not staying still in the Java space. [read]

  7. This is genuine Microsoft (120): I started playing with Google Web Toolkit beta- actually I didn’t really start. Because I had to uninstall IE7 (which I don’t use at all), but hey I’d been curious. [read]

  8. Become a Java Champion, stay in useless Country, Learning Java for what? (120): I just thinking, what should we learn Java? Why dont use dotNet, I read Matt blog about his income US$ 200k more, or Mike Conan in OZ, that become the good best company. Today I just dont know, [read]

  9. jBixbe: a java tool I consider ... buying ! (90): I found this tool on Erik's linkblog, thanks to him ! [read]


Tuesday, May 16, 2006

Top 10 Most Read Last Week On Javablogs.com, Week 19


Most read last week

  1. Axis2: Why bother? (257): The Axis team is kicking up a big fuss about their recent release of Axis 2 (1.0!) Surprisingly, this library is so so abysmally bad, [read]

  2. Google trends proves: Java is doomed (251): Google trends is a nice idea, and I had to apply it adhoc to Java, Ruby, Python and C#. Interesting results, I can see a decline in Java! [read]

  3. Rich Open Source Webmail that doesn't suck (219): Guys...lets face it. Squirrel Mail... So check out our killer rich webmail. [read]

  4. Your Next Programming Language (216): Many people talk about how, as software developers, we should learn new programming languages frequently. [read]

  5. How to recognize a "Sacred Code" (210): You know you are dealing with a "sacred code" when you ask a previous developer (or the designer of the code) a question about the code and his immediate reply is .... [read]

  6. All you ever wanted to know about Workflow and how it relates to Java, Transactions and Concurrency (204): Read this blog carefully and you're in for a PAYRAISE. Workflow and business process technology will be essential in developing next generation applications. The knowledge about it is scarce. [read]

  7. Omg - I love this (Mac users may not) (203): This guy doesn't like Macs Damnation this is funny.... [read]

  8. 7 Reasons Why Web Apps Fail (179): Web applications are popping up faster and faster every day, and quite a few are using the power that Ajax offers to their advantage. [read]

  9. Scaling out 37 Signal-style applications is convenient (179): I had someone telling me that: Ruby can scale. Basecamp prooves that. Now, you all know that I do not think that Ruby has ANY problems with scaling. However, [read]


Most read last week-end

  1. Omg - I love this (Mac users may not) (203): This guy doesn't like Macs Damnation this is funny.... [read]

  2. How to Design a Good API (176): I was reading this presentation on the Design of API's by Joshua Bloch it talk's about how to design a good api but more importantly the reasons why doing certain things results in a good design. [read]

  3. JRuby on Rails Is Born (172): JavaOne attendees are in for a treat. Not only will they be receiving a DDJ issue which calls Rails a tipping-point to a new era in enterprise computing (or something like that)... [read]

  4. JavaOne day -1 : Bird Strike (147): The plan was to fly out from Sydney to San Francisco today. The plane was fueled, the travelers boarded. The aircraft taxied out to the runway, takeoff speed was reached, [read]

  5. 10 things i love about my Mac (125): A switcher Top 10 of nice things on Mac OS X: 1. The way programs live in the system (no registry shit) 2. The shell 3. Firewire boot capabilities 4. Apps like iChat, iSync and Addressbook 5. [read]

  6. YouTube bandwith usage/costs ... AMAZING ! (121): While looking for successfull video hosting I found this techcrunch article about youtube called Did YouTube Just Raise another $25 million? [read]

  7. 10 things i hate about my Mac (113): A switcher Top 10 of ugly issues with Apple Mac OS X: 1. No @ key in boot camp windows installation available 2. All banking programs on mac really suck 3. adv. [read]

  8. Commons Collections 3.2 Released (113): Commons Collections 3.2 has been released. Commons Collections is a library that builds upon the Java Collection Framework. It provides additional Map, [read]

  9. GoogleTrends : Java vs C# vs PHP (112): La comparaison est un poison. Ceci dit, comparer "l'intérêt" pour java, C# et PHP avec GoogleTrends, le nouveau service de Google, était très tentant... [read]


Tuesday, May 09, 2006

Last week Javablogs.com top 10


Most read last week

  1. $19 for a more productive environment (264): I just made a great improvement to my working environment in my home office. [read]

  2. Top 10 Selling Java Books (238): What are the top selling Java Books this year, what are the top selling Java books the last month. I was wondering this whilst nursing a hangover and wondering what to blog about this afternoon. [read]

  3. Top 10 Java Features (or What Makes Java Great) (227): Here is a list of top 10 java features I constantly use and highly recommend; features which makes Java great as a language and platform. [read]

  4. Martin Fowler on continous integration (224): Martin Fowler has rewritten his famous article on continous integration. Over here! [read]

  5. New Speed trick for Eclipse (223): You might have not played with the settings ye ton Eclipse 3.2RC2. Evidently, [read]

  6. Ruby on Rails in the Enterprise (218): I started using Ruby on Rails a few months ago and I can‘t remember I have ever been so happy while coding in past. [read]

  7. How to create an app in one day (216): Sometimes people do irrational and weird things and I am not an exception. Yesterday I figured that I need to relax a bit and create something ... [read]

  8. Most People Are Below Average (184): The almighty Seth Godin says: "In every category, in every profession, half the people are below average.". Bzzt! Wrong! Na-ah. Half the people in any industry are below the median. As in, [read]

  9. We're hiring! (183): ...because we're losing one of our best people. Matt has a rare combination of mad technical skills, graphic design talent, [read]


Most read last week-end

  1. Top 10 Selling Java Books (238): What are the top selling Java Books this year, what are the top selling Java books the last month. I was wondering this whilst nursing a hangover and wondering what to blog about this afternoon. [read]

  2. Eclipse 3.2 RC3 Released (171): Right on the heels of Eclipse 3.2 RC2 (last week), Eclipse 3.2 RC3 has been released. Won't be long now before the platform is locked down for final. Check it out! 36 words. [read]

  3. Guide lines to becoming a solid developer. (164): It doesn't take much too rapidly become a solid developer. First let's define what a 'solid developer' means. [read]

  4. Sun ready to join Eclipse. Pigs fly, news at 10. (158): Ed Burnette recently blogged in a 2 part series ( here and here ) about Sun considering joining Eclipse if two impossible criteria were met. Let's not hold out breath. 444 words. [read]

  5. Sun and Eclipse part 2: Getting past SWT vs. Swing (149): In this installment I continue my conversation with Sun Microsystem's director for Java Tools, Tim Cramer. Can Eclipse and Sun get past the whole SWT vs. [read]

  6. Apache Rampart 1.0 released (126):$entry.description.replaceAll('\s+', ' ')

  7. Sun and Eclipse part 3: Who's eclipsing whom? (125): This is the last of a three-part series on the possibilities of Sun joining the Eclipse Foundation. In this part I'll discuss the two remaining "conditions" that Sun had for joining, [read]

  8. JTrac 2.0-EA1 released (120): JTrac is an open source issue tracker based on Spring, Spring WebFlow, Acegi and Hibernate. [read]

  9. Stop Whining...Dilbert style :) (114): [read]


Thursday, May 04, 2006

First Steps With EhCache

If you need to cache objects in your system, Ehcache is a simple cache written in Java, widely used and well tested. I will present here a short tutorial on how to use EhCache for people who don't want to look around the documentation at first, but just want to test if it works in their project and to see how easy it is to setup.
Installation
Download Ehcache from the Download link on http://ehcache.sourceforge.net. Current release is 1.2.
Unpack Ehcache with an unpacker that knows the tgz format. For unix users, it is trivial, for windows users, 7zip is a free (and open-source) unpacker. It is probably the most popular, but there are other ones like tugzip or izarc or winrar.
In your java project you need to have ehcache-1.2.jar, commons-collections-2.1.1.jar and commons-logging-1.0.4.jar (versions numbers may vary) in your classpath, those libraries are shipped with ehcache.
Cache Configuration
Write an ehcache.xml file where you describe what cache you want to use. There can be several files per project, several cache descriptions per file. I use here a persistent cache. Configuration file is well described at http://ehcache.sourceforge.net/documentation/configuration.html
<ehcache>
<cache name="firstcache" maxElementsInMemory="10000" eternal="false" overflowToDisk="true" timeToIdleSeconds="0" timeToLiveSeconds="0" diskPersistent="true" diskExpiryThreadIntervalSeconds="120"/>

</ehcache>

Code
static  {  
//Create a CacheManager using a specific config file
cacheManager = CacheManager.create(TestClass.class.getResource( "/config/ehcache.xml"));
cache = cacheManager.getCache("firstcache");
}

/**
* retrieves value from cache if exists
* if not create it and add it to cache */
public String doit(String key, String value) {
//get an element from cache by key
Element e = cache.get(key);
if (e != null) {
value = (String)e.getValue();
LOGGER.info("retrieved "+ value+" from cache ");
}
else {
value = "new value" ;
cache.put(new Element(key, value));
}
return value;
}

/**
* refresh value for given key */
public void refresh(String key) { cache.remove(key); }

/**
* to call eventually when your application is exiting */
public void shutdown() { cacheManager.shutdown(); }
Conclusion
Using EhCache is as simple a using a Java Map with an additional configuration file.

Tuesday, May 02, 2006

Last week Javablogs.com top 10


Most read last week

  1. Tomcat killer ? Not exactly, but ... (322): ... I am impressed. Winstone is a micro servlet container with a distribution size of 160 KB ! And it's also interesting to note that it does not rely on any configuration file. [read]

  2. yet another jboss innovation :-) (299): http://www.jboss.com/products/jbossmc a bit pathetic, isn’t it? [read]

  3. Google seems like a bad career move to me (246): It is fair to say that if you are doing anything cool right now and people know about it that Google will try and recruit you. However, let me ask you this... [read]

  4. 10 Things I've Learnt About Working With Developers (243): I've been a technical writer for 10 years. In those 10 years I've learnt at least 10 things (which means one thing per year, yay) about working with developers. Here they are in no particular order. [read]

  5. Three Languages For Java Programmers (222): Dave Thomas, among others, has been saying for years that you can become a better Java programmer by getting out there and learning some other programming languages. [read]

  6. The next Azureus? (218): Cross your fingers and do the jiggly dance, missy, cuz there’s a new boy in town, and he’s cute, eye-wateringly colorful, and full of Java, Java, Java. [read]

  7. Free Books: Subversion, Jakarta Commons and more (214): I love good books. Good books make the difference between searching google for hours and having the answer right there. Unfortunately, I can't have all the books I want, [read]

  8. There's a lot to love about Eclipse Web Tools Platform...(with screenshots) (203): If you haven't played with Eclipse WTP and you do web development... then today is your day to start. I've been using it for about a month now and I've gotta say I am totally impressed with it. [read]

  9. What's the difference between Java 5 and Java 1.5 (195): We were going to ask the question in the title to people we were interviewing for a job. [read]


Most read last week-end

  1. why he hates tomcat (188): at the bileblog: Why I hate tomcat Of course, I’m sure the tomcat fanboys would quickly whip out their collective p3n15 and wave it about angrily. After all, it’s still popular, right? [read]

  2. The Best Feature Of The Upcoming NetBeans IDE 5.5 (Part 1) (181): NetBeans IDE 5.5 will totally redefine the word "productivity". I mean, forget wimpy little bits of code, pretty little samples, and obsequious hints and suggestions. Think big. How big? Well, [read]

  3. Quit your Day Job (178): It’s been a while since the last post, but a lot has happened for me since then. I’ve taken the plunge and quit my day job. I spent some time thinking down in Australia, [read]

  4. At Last! The Super-Important Bug 5034036 is Fixed! (143): I have been waiting for this one to get fixed. Luckily Sun is putting their effort where it is most needed. It only took two years, which is quite good for a bug this complicated. [read]

  5. Interfaces vs abstract classes (131): I recently read an article titled interface vs abstract class and imagined an Alien versus predator type block buster movie, [read]

  6. Eclipse 3.2RC2 Released (122): Eclipse 3.2RC2 has been released on the world. Build notes for this release are available here . 39 words. [read]

  7. Grail [was: Groovy On Rails] (113): Yesterday I made my first steps with Grails, actually I wanted to get something started with RoR, but I decided otherwise. [read]

  8. Chuck Norris News and Apology (107): Dear readers, first let me apologize for any part I had to play in the series of unfortunate events which have taken place since my posting of java.lang.ChuckNorris.java. [read]

  9. JavaMail goes Open Source (100): JavaMail is now open source as part of the GlassFish project. Can we get those JARs on ibiblio now and make Maven more usable? April 19, 2006 JavaMail is now open source! [read]