A beautiful race condition observed in the wild

Working on legacy code can lead you to seeing some things you’d never otherwise see. A few years ago I read the blog post “A Beautiful Race Condition”. It has a very detailed explanation of what can happen under the hood of a java.util.HashMap, which is not thread safe, when someone uses it as if it was. This was an interesting but hypothetical topic for me, because the problem is known and I’d been working on web services where we avoided sharing objects between threads.

Fast forward a few years and a couple of positions, and now I’m working on making a good-sized monolith able to scale. One day the operations team contacted me to say that one of our production servers was setting off alarms because its CPU usage was hitting 100%. When I got a thread dump from the problem server, what did I see but

  java.util.HashMap.getEntry(HashMap.java:347)
  java.util.HashMap.containsKey(HashMap.java:335)
  name.changed.to.protect.the.guilty.Example.getSomething(Example.java:1234)

(Unfortunately I can’t show the real code because it’s proprietary)

There were multiple threads stuck in this state. Seeing the stack trace triggered just enough recollection that I was able to google up that blog post. Rereading it confirmed that the description matched, which pointed out where I needed to go to do the fix.

For our servers it seems that one stuck thread will cause 1 CPU to report full usage. Two stuck threads a on 4 CPU VM caused 50% CPU usage for the entire server, and four stuck threads a on 4 CPU VM caused 100% CPU usage for that server. The system was otherwise responsive, so these threads weren’t preventing other threads from getting CPU cycles. That was fortunate, because the only way to get the CPU usage back down is to restart the JVM, and the only way to prevent the problem from recurring is to fix the code and deploy a new version.

It’s straightforward to make sure you’re not passing HashMap objects from one thread to another. A case to watch out for is using a HashMap as the value in a cache – if two different threads both get cache hits on the same key, the HashMap value object will end up shared by those threads. Since the contents of a cache are intended to be shared, any value stored in a cache should be a ConcurrentHashMap or one of the immutable Map implementations that’s available.