Waking Sleeping Tests

10 Jul 2016 java testing linux

The Problem

When running through your test suite on your own machine, everything passes and the code looks like it’s in a great place. As soon as its run by CI, you see the flaky integration tests fail day after day. What gives? I tend to see two situations (sometimes of my own making) that result in these failures.

The reason these problems crop up on CI is that the CI machine’s CPU tends to be fully utilized (or over-utilized). In that case, the timeouts or Thread.sleep() calls tend to not last long enough.

Let’s look at scenarios with Thread.sleep(). These tests need to be refactored, but how does one go about doing that? The first step is always reproducing the failure.

Refactor Part 1 - Reproducing the Failure

I very recently figured out a simple way to reproduce these failures with the help of a few Linux commands (which also work on OS X). To start, I would suggest running in a VM allocated with only 1 CPU. Vagrant is a good tool to help with that.

tool - dd

First, we need a simple way to start eating up CPU cycles. The dd command can be used in a clever way to completely use one CPU.

dd if=/dev/zero of=/dev/null

This tells dd to read from /dev/zero, which just constantly produces zero-bytes, and write them to /dev/null. This essentially does nothing, but can take up 100% of a CPU if nothing else is running.

tool - nice

Now with dd running, you can run your flaky test with a high nice value. nice gives a process (and any of its child processes) a scheduling priority. Zero is the highest priority of non-root processes, and is the default.

To run your process with a lower priority, use a higher number up to 19.

nice -n 1 script-to-run-just-my-test.sh

Put it together

With these tools in hand, we can go ahead and run the test, provoking a failure.

$> dd if=/dev/zero of=/dev/null &
$> nice -n 5 ./run-my-test.sh
...see the failure...
# stop the instance of dd
$> kill %1

Additionally, running multiple instances of dd can help here. There is some trial-and-error with finding the correct value for nice and/or how many instances of dd to run. Watching the test’s process in top or htop to see how much CPU it gets is helpful.

Now that you can reproduce the error, you can determine when a fix has most likely fixed the test. I say “most likely” to really mean that you’ve reduced the occurance of the test failure to some acceptable value, possibly even less than 0.1%.

Refactor Part 2 - Remove the flakiness

Refactoring flaky tests that rely on Thread.sleep() could be an entire series of blog posts. So, I’ll give some high-level advice on this that might get fleshed out in subsequent posts.