Wednesday, April 14, 2010

Thoughts on Fowler's Continuous Integration

It's always kind of nice to go back to the basics. I've always enjoyed re-reading basic programming practices and patterns. I tend to forget the things I don't use on a daily basis. That's why I enjoyed reading Martin Fowler's article on Continuous Integration. The article says the last significant update occurred May 2006, but it's withstood the test of time; much like The Cathedral and the Bazaar. But if you don't have the time to read this rather long article, here are a few of the favorites I pulled out as I read it over the course of a few days. Before that, let me explain a little bit of my experience.

At my first programming job we didn't really have a VCS (Version Control System) like CVS or SVN nor did we have a CI (Continuous Integration) server; we really didn't know any better. We essentially did all of our work straight off a shared drive (I know). But that was before I came to Gestalt, now Accenture, 5 years ago. Since then I've been exposed to CVS-->SVN-->Git, Ant-->Maven 1-->Maven 2,  CruiseControl-->Hudson, and finally TDD (Test Driven Development). Being exposed to all of this has been a huge improvement to my career. More importantly it's been a huge benefit to how I write software and the tools our teams use such as VCS and CI. I can't image developing software without them.

Here are some points out of Continuous Integration that I would think applies to any project java or not:

Work does not stop on your commit

"However my commit doesn't finish my work. At this point we build again, but this time on an integration machine based on the mainline code. Only when this build succeeds can we say that my changes are done."
So true. Just because you ran some tests locally or manually tested it, doesn't mean your done just because you checked in your changes. You've got to monitor CI to ensure it passes. This has been a topic of discussion on my team lately as we've come in in the morning with a few broken builds. Solution: check in often during the day, but don't checkin and leave and not verify CI passed. Either stay late, sign in at home, come in early, or checkin first thing the next day.

Simple checkout build rule
"The basic rule of thumb is that you should be able to walk up to the project with a virgin machine, do a checkout, and be able to fully build the system."
This is a very important point. Not only will this improve the productivity of new team members but also reduce the amount of time it takes to create new CI jobs. This rule is even more important for open source projects. I've had several issues in the past trying to patch open source projects and wasted several hours just trying to build their code. If you want people do contribute to your project, make it easy for them to build your software. For example, I've been wanting to write a simple Docky plugin for Hudson, but have ran into several issues (New Plugin and Missing Package) trying to build the Do project. Have those questions really been Answered? NO! What have I done about it? I haven't retried it since. To restate Mr. Fowler, I should be able to easily checkout your code and at a minimum build it. As an added bonus it'd be nice to run unit tests as well.

Automate everything
"However like most tasks in this part of software development it can be automated - and as a result should be automated. Asking people to type in strange commands or clicking through dialog boxes is a waste of time and a breeding ground for mistakes."
If your just getting started with CI this can often be difficult. But your long term goal should be to automate everything. This includes creating/destroying your database, deploying/undeploying your application, automating your tests, copying configuration files around. I'd even go as far as to say automate the creation of the development environment: installing maven and java for example. Again this not only speeds up new team members productivity but also those virgin CI servers.

Two great examples of this. Before we had a internal CI team, our team was manually setting up multiple CI servers with maven, java, jboss, and a database. These new servers couldn't be used until all of this stuff was manually configured. Then our internal CI team helped automate some of this stuff and we can very easily use hudson to point jobs at different servers within minutes. Something that wasn't really possible before without manually intervention. And all they really did was call a few simple ant copy commands from maven.

Another good example of this comes back from our old CruiseControl and Ant days. At one point in our project we were constantly breaking a major piece of functionality and one of the main reasons was it was very difficult to test. It was a distributed test with multiple servers communicating with multiple clients via SIP. The build process called for building the latest code, stopping 2 instances of weblogic (1 local, 1 remote), starting weblogic, deploying the latest code, waiting for weblogic to finish starting (not easy mind you), and then running our automated test. This was rather huge undertaking, but given a few weeks we had the core of it automated. It was amazing. I never thought it would have been possible, but it was and anytime that test failed we knew immediately we broke something. We were able to accomplish the difficult parts by calling remote bash scripts via ssh from ant.

Imperfect Tests
"Imperfect tests, run frequently, are much better than perfect tests that are never written at all. "
Not exactly sure what he means by imperfect tests, but this is one place I currently disagree. It takes practice to write good tests. Once you refactor and maintain tests over a long period of time you start getting pretty good at writing tests that require less refactoring. One of the things that is killing the productivity of our team right now is what I call "cronically failing tests" or tests that randomly fail for no reason. You check the change log and nothing changed in the build which means it shouldn't have failed. You rebuild the job and it passes. Here lately this can be attributed to date comparison asserts and issues with timing. For example, the test passes when the database is local, but fails when the database is remote. Or you get different results when the time on the database server is not sync'd. The end result is this produces false negatives that really hurt the validity of CI; developers just start ignoring all failures. Once you've identified one of these cronically failing tests, it's important the author of that test, or the person who last modified it, refactor the test to be flexible. If the author doesn't do it, they will continue producing these types of imperfect tests.

Good Build Characteristics
He had several comments I would wrap into good general build characteristics. Two of which are fast builds and accessible artifacts. As a general rule he suggests keeping build times to around 10 minutes. Which is usually achievable for compile/unit test jobs, but database related and above can usually take longer. My general guideline is try to keep those longer running builds to around 30 minutes, but definitely no longer than an hour. Unfortunately right now, we have several of those 40-55 minute builds I'd like to trim down some. It'd be great to see a hudson plugin that could show me how long each part of my build took.

With a combination of our company maven repository and hudson, it's pretty easy to make our artifacts accessible. This is really huge as sometimes I don't waste time building certain things that take forever to build; I'll just download them from hudson. I know a lot of times our DBA will just download the zip he wants to test and prevents him from updating his source and building, etc. Another related topic is we have several nightly jobs that deploy the latest code to jboss/websphere that can be used the next day by everyone to see/test/verify the latest code.

Rollback Deployment
"If you deploy into production one extra automated capability you should consider is automated rollback."
This was a pretty new concept for me and one we don't necessarily follow. I've heard of Continuous Deployment, but never really heard about a rollback feature. I know we've accidentally benefited from a build failing and not deploying the latest nightly code thus allowing us to perform diff-debugging to track down a bug. We had 2 servers that built the night before, 1 passed and the other failed so it contained the previous days build. A bug was detected on the passing server and we were unable to reproduce on the outdated server. This told us it had been introduced in the past 24 hours. This isn't exactly rolling back but maybe the morale of the story is keeping a server around that is behind a day.

Summary
There is a lot of good general information in this article and I would encourage anyone to take the time to read it. I only highlighted the things that really stuck out at me; there were a lot more useful things I passed mentioning.

Thursday, April 8, 2010

Possible solution for WebSphere issue NMSV0310E

It's been quite a ride porting our EAR from JBoss 4.2.1 to WebSphere 6.1. Installation issues on CentOS and Precompiling JSPs were just a few of the issues we encountered. We are still producing 2 different EARs, but both are based on identical WARs.

The latest issue had to do with receiving a WebSphere exception from a Init Servlet that started a separate Thread. This Thread contained code that indirectly performed a JNDI lookup to get a Datasource. Unlike JBoss, WebSphere apparently doesn't like Unmanaged Threads performing JNDI lookups.

Here is the WebSphere exception:

"javaURLContex E NMSV0310E: A JNDI operation on a "java:" name cannot be completed because the server runtime is not able to associate the operation's thread with any J2EE application component. This condition can occur when the JNDI client using the "java:" name is not executed on the thread of a server application request. Make sure that a J2EE application does not execute JNDI operations on "java:" names within static code blocks or in threads created by that J2EE application. Such code does not necessarily run on the thread of a server application request and therefore is not supported by JNDI operations on "java:" names"

This seems to be a rather common issue in WebSphere with multiple possible solutions. I guess ideally you should try and configure a container-managed Thread using CommonJ (see section Scheduling and thread pooling). Unfortunately, the configuration is different for JBoss.

Fortunately, I think I stumbled upon another solution. While reviewing the WAR Spring applicationContext.xml file that configured our Datasource I noticed a property called lookupOnStartup. It was set to false and when setting it to true, the exception went away.

<bean id="dataSource" class="org.springframework.jndi.JndiObjectFactoryBean">
    <property name="jndiName" value="java:comp/env/jdbc/MY-DS"/>
    <property name="lookupOnStartup" value="true"/>
    <property name="cache" value="true"/>
    <property name="proxyInterface" value="javax.sql.DataSource"/>
</bean>
When setting it back to false the exception appeared again; setting it back to true the exception was gone. Unfortunately, I can only speculate as to why this solved the issue. My guess is, when lookupOnStartup was false, the first attempt to get the Datasource was from a separate Thread which WebSphere didn't like. However, when setting lookupOnStartup to true, the first time the Datasource was retrieved was by a container-managed Thread and once my separate Thread needed the Datasource it was already looked up and cached. According to the javadocs for JndiObjectFactoryBean.setLookupOnStartup the default is true, so it can't be that bad, right? I can't think of a reason why someone would want this delayed.

If you ever run into this issue, you might consider setting lookupOnStartup to true and see if that fixes your issue.