Aug 092007

I was reading the last news letter from the guys over at JavaLobby and Matt Schmidt mentioned something about restarts. Here’s his comment:

Sometimes It’s Ok To Restart Your JVM
Now, I’m sure I’ll catch a flack for saying this, but I’m definitely not the first. Sometimes, it is ok to just restart your JVM. Now, I’m not just talking about restarting it because you’ve deployed some new code, no, I’m talking about just restarting it for good measure. Maybe you’re restarting it when it reaches a certain error condition or even a certain amount of memory. Some of us value our sleep at night, and when things start to go awry with software that you didn’t write and you can’t seem to fix it, we start to think about solutions that we don’t normally speak of.

It’s these solutions that many of you will scoff at, but sometimes a simple little monitoring hack can save a lot of headaches. These hacks can re-introduce a modicum of stability in a system that was previously not stale and can return some sanity to your developers who do occasionally need to sleep. So, the moral of the story is that you don’t always need to have the super clean solution; sometimes a hack works just a well. But remember, you have to go back to that problem and actually solve it. A hack is just that, a hack, and it won’t hold forever. Even duct table breaks eventually :)

Perhaps I’m a bit of a stability and uptime snob having worked at Orbitz, but this made me really uneasy. Matt, who works on pretty decently sized applications was advocating restarts and hackery. JavaLobby is probably an order of magnitude or two smaller than Orbitz with very different usage patterns. Therefore, it is probably okay for them to restart JVMs at 3 am sometimes or even schedule restarts. But I disagree that restarting as a practice due to some unknown instability is correct. Even if you didn’t write the unstable code, it still doesn’t mean that it shouldn’t be fixed. In fact, most of the time when something really goes awry, the folks that wrote the code are generally willing to help you fix it. And, most software we use these days is open source. Jump in there and fix it your self.

Also, I disagree that restarting JVMs is necessary or even safe. Once you have more than one server for an application a restart could actually impact other servers. You have to understand the issues surrounding restarts because overall system performance might be impacted by a simple restart. I’ve see more cascading failures due to a simple restart than I’d like to remember. The better solution is rarely a restart, but almost always a fix.

Lastly, just to put some perspective in it, at Orbitz we had many machines that could run without issue for months on end without requiring a restart. Most of the time restarts were necessary only during major system failures. However, even in those cases it was always an investigatory process in order to find and fix the bug that caused the instability and never something that was done regularly. However, I don’t fault Matt for this frame of mind. Many applications are built based on restarts and often restarts become stability best practices at some companies. Having worked at a company whose many goal was to ensure that anyone in the world could book their travel 24/7/365, restarts just weren’t on the menu.