JVM restarts

I was reading the last news letter from the guys over at JavaLobby and Matt Schmidt mentioned something about restarts. Here’s his comment:

Sometimes It’s Ok To Restart Your JVM
Now, I’m sure I’ll catch a flack for saying this, but I’m definitely not the first. Sometimes, it is ok to just restart your JVM. Now, I’m not just talking about restarting it because you’ve deployed some new code, no, I’m talking about just restarting it for good measure. Maybe you’re restarting it when it reaches a certain error condition or even a certain amount of memory. Some of us value our sleep at night, and when things start to go awry with software that you didn’t write and you can’t seem to fix it, we start to think about solutions that we don’t normally speak of.

It’s these solutions that many of you will scoff at, but sometimes a simple little monitoring hack can save a lot of headaches. These hacks can re-introduce a modicum of stability in a system that was previously not stale and can return some sanity to your developers who do occasionally need to sleep. So, the moral of the story is that you don’t always need to have the super clean solution; sometimes a hack works just a well. But remember, you have to go back to that problem and actually solve it. A hack is just that, a hack, and it won’t hold forever. Even duct table breaks eventually 🙂

Perhaps I’m a bit of a stability and uptime snob having worked at Orbitz, but this made me really uneasy. Matt, who works on pretty decently sized applications was advocating restarts and hackery. JavaLobby is probably an order of magnitude or two smaller than Orbitz with very different usage patterns. Therefore, it is probably okay for them to restart JVMs at 3 am sometimes or even schedule restarts. But I disagree that restarting as a practice due to some unknown instability is correct. Even if you didn’t write the unstable code, it still doesn’t mean that it shouldn’t be fixed. In fact, most of the time when something really goes awry, the folks that wrote the code are generally willing to help you fix it. And, most software we use these days is open source. Jump in there and fix it your self.

Also, I disagree that restarting JVMs is necessary or even safe. Once you have more than one server for an application a restart could actually impact other servers. You have to understand the issues surrounding restarts because overall system performance might be impacted by a simple restart. I’ve see more cascading failures due to a simple restart than I’d like to remember. The better solution is rarely a restart, but almost always a fix.

Lastly, just to put some perspective in it, at Orbitz we had many machines that could run without issue for months on end without requiring a restart. Most of the time restarts were necessary only during major system failures. However, even in those cases it was always an investigatory process in order to find and fix the bug that caused the instability and never something that was done regularly. However, I don’t fault Matt for this frame of mind. Many applications are built based on restarts and often restarts become stability best practices at some companies. Having worked at a company whose many goal was to ensure that anyone in the world could book their travel 24/7/365, restarts just weren’t on the menu.

19 thoughts on “JVM restarts

  1. Hi Brian. Just to clarify, it’s not something that I would recommend all the time 🙂 In this case, it was one particular service who was mis-behaving, something we didn’t write ourselves and which we have been unable to get support on enough to fix the issue. In general, we pride ourselves on keeping our services running as long as possible and generally don’t like to restart things 😉

    Like

  2. I completely understand those cases, but I think the idea is like you said, to first and foremost fix the underlying issue. I wasn’t particularly singling you out for these comments. I’ve worked in many places where restarts were required. I think as a philosophy though, they should be avoided. Likewise, if this service is misbehaving and you don’t control it, wrap it or proxy it within a JVM if possible. That way you can control how it is misbehaving and fix it (to some degree). ClassLoaders work nicely in these cases as do systems like OSGi. We implemented something like this for Orbitz where in services were all proxied using futures so that we could time out as well as completely ignore their instability if necessary.

    Like

  3. I decided to restart our ActiveMQ JVMs around 4AM every night and it ended up being a cheap fix for a low priority problem for us.

    I’m with Matt on this one.

    Like

  4. Restarting is just a fact of life in an unstable world. In the JVM, if you have a “stuck thread” that’s blocked and waiting for a lock to be resolved, you’ll be required to restart the JVM. This can happen when using Oracle’s JDBC driver and the RDBMS has a locked table. If you can’t get the table unlocked for some reason, you may have to recycle the JVM while you’re figuring out how to keep the database table unlocked. This is the art of restoring service in a critical environment. In a crisis, you do whatever it takes to keep the systems running.

    Application developers should also consider exiting the JVM when certain exceptions occur. If you get an OutOfMemory exception, you should terminate and start from a good state. There are many cases where the JVM process should die and be restarted rather than trying to continue on in a corrupted state.

    Like

  5. Hi Brian. It got posted no dzone.com and I found it there. Looks like it just made it to the frontpage there, so I’m sure we’ll be seeing more traffic to this post 🙂

    Like

  6. I agree with Matt on the specific case.

    I am not in favor of restarting due to instability problems overall, though. If the crashing service is one in which you expect growth, it tends to only defer the problem.

    A weekly restart can turn into a 6 day restart (and then 5… 4… so on) for exactly that reason. And then when you do try to fix it, you find out that you’ve introduced a dozen other similar problems that you haven’t noticed because you’ve been papering over all of them.

    It’s much better to deal with the problems as soon as possible. Getting woken up at 3 in the morning a few times will get them fixed even faster. 🙂

    Like

  7. DK,

    Hmmmmmmmm….. I guess my main concern is that system stability isn’t as simple as JVM restarts. I mean you remember when a simple restart brings an entire system down. Or worse, a simple restart means the box doesn’t ever come back up. I just tend to look at restarts dangerous and to be avoided. In most cases there are fixes.

    However, now that I own my own company, I can definitely understand the trade-off between correct the fix and the cost effective fix.

    So, I guess I’m looking at the fence from my air conditioned living room and you guys are on the other side. 😉 At least I know there is a fence. hehe

    Like

  8. David O’Meara,

    I have definitely been there with 24/7. But even with restarts you still get paged. I’d rather fix the problems so that so at least I’m only getting paged when the system is jacked and not when my restart cron has failed or worse the restarted box won’t come back up.

    Like

  9. Tim Goeke,

    I disagree that it is a fact of life. I know for a fact it isn’t. You can build systems that don’t need restarts. In fact, there are a number of JSRs in the works to specifically address requiring restarts for software updates.

    OutOfMemory errors are application bugs. Stuck threads and dead locks are also application bugs. Everything you’ve stated is essentially fixable. I say don’t settle for restarting a JVM because the bug is hard to fix. Plus, I think you are talking about failure scenarios and not the scheduled restarts that Matt was referring to. And on that front you are correct. Restarts (currently) are required to fix a broken system. It’s the only way to get your system up and running again. But, when crisis restarts turn into scheduled restarts, I get uneasy.

    Like

  10. In some cases, it is necessary to restart the JVM. I agree that it should not be the usual path taken when a program becomes unstable, but as someone pointed out, it is hard to recover from an OutOfMemoryError. It’s possible, but it takes a lot of care. It should also be noted that an OutOfMemoryError could arise because the user started the JVM with incorrect -Xm? arguments.

    The situations where restarting the JVM is unavoidable, though, are when your program requires a different classpath than it has at startup, or when you want to change the value of the “user.dir” property. Changing either might seem like a rare thing, but it was necessary and caused quite a bit of headache during the development of DrJava, our open-source Java IDE (www.drjava.org).

    Like

  11. Mathias,

    I’m not sure I follow you with changing the user.dir or classpath comment. Setting up a new classloader after the application starts up is pretty common. This is the root of J2EE containers. Most applications insulate themselves in this manner so I’m not sure how this impacts restarts. I’ve found in most cases that it actually helps to create classloaders after startup because it allows you to throw them out and recreate them if necessary.

    As for the user.dir thing. I can kinda see this since system properties are JVM scoped. Is that what you are referring to?

    Like

  12. Brian,

    You mentioned earlier that there is are a number of JSRs in the works to specifically address requiring restarts for software updates. Do you have any other info on these? Like the JSR numbers? Any links to more info on this would be greatly appreciated.

    Like

Leave a comment