You write your test cases, run them in your environment in random order, and see all of them pass. But when you push your code later and have your CI server run the tests again, one of them fails for no apparent reason.
You run it again, and it fails again. But suddenly, it passes despite you not making any changes to the code or the test itself.
That’s a flaky test in your hand — and they sometimes pass and at times fail with no obvious reason. And yes, they’re just as frustrating as they sound.
There are many problems with flaky tests. They slacken development and progress, hide crucial design problems, and can turn out to be very expensive if not handled faster. They also significantly hinder the efficiency of CI: 47% of failed jobs succeeded in the second round when manually restarted.
A flaky test is an analysis of web app code that returns both passes and failures each time you run the same analysis; it doesn’t produce the same result.
Unreliable test results stem from several factors like inconsistencies in the build environment, timing and time zone issues, failing to refresh data between test runs, and increased dependencies on the test execution order.
Regardless of the reason, flaky tests slow down your CI/CD pipeline and reduce your team‘s confidence in testing processes. There’s a lot of uncertainty and second-guessing. You’re unsure whether a successful test run means your code is free of bugs or whether you should spend more time trying to reproduce and fix an issue when a test fails.
Typically, you can categorize flaky tests into the following two heads:
In case of an order-dependency issue, the script above will print the minimal reproduction.
Several devs have publicly shared stories when they stumbled onto a couple of flaky test issues. Product developer Ramona Schwering shared her testing nightmare with Smashing Magazine, which does a brilliant job at highlighting the unpredictability of flaky tests.
For a UI test, Schwering and her team built a custom-styled combo box that let users search for a product and select one or more results. The testing was going fine for a few days, but then suddenly the test for searching and selecting a product in the combo box failed in one of the builds in their CI system.
A flaky test like this blocks the continuous deployment pipeline, slowing down feature delivery. These tests are also expensive to repair, requiring devs to put in several hours and even days trying to debug. The fact that the testing isn’t deterministic anymore means you cannot rely on the testing, too.
The more variables you introduce within a test suite, the greater the likelihood of flaky tests because more variables mean more risk factors. End-to-end tests and integration at scale also result in a substantial level of flaky test results with the more complex contributing variables within a test suite.
But why are flaky tests so undesirable?
Aside from the frustration they cause, flaky tests drain your resources. They require developers to analyze and retest their codes, which leads to a significant wastage of time and costly interruptions.
Also, tribal knowledge is a vulnerability within test suites, especially due to the lack of ownership of historic knowledge of past test results. Many organizations today don’t have an updated, accessible database of this information, which causes the knowledge to become siloed by a team or individuals.
Flaky tests also breed code mistrust and general test outcome wariness that causes the engineering team to not trust test results. The whole point of testing is to get reliable results, and if you can’t trust them, why waste time creating them?
More importantly, if tests fail with false negative results every time, devs lose trust in the test suite. When they ultimately find a real bug using this test, they will think there isn’t one simply because they are already used to seeing the tests fail, even when there is no issue in the app at all.
It’s important to take any error in the build seriously. Simply assuming a flaky test isn’t a real bug and therefore doesn’t need to be taken care of or even debugged is wrong.
Here’s a list of measures to effectively deal with flaky tests:
At Buildkite, we recommend fixing/replacing flaky tests to avoid costly interruptions to the main branch builds. The fact this will also improve your test suite’s reliability rate and identify (and fix) where your biggest problems are is another significant advantage.
Replacing the flaky test is another alternative, where you delete the test and write it from scratch, preferably by a developer who didn’t see the flaky one.
If you cannot develop stable tests for some part of the code, it means either something is wrong with the test and/or the testing approach or something is wrong with the code being tested. So, if you’re certain the tests are fine, it’ll serve you well to take a deep look at the code.
Buildkite’s newly released Test Analytics tool makes it easier to identify, track, and fix/replace problematic flaky tests. It integrates with your test runner and can work with any CI/CD platform to give you in-depth information about your tests in real-time.
Buildkite is the fastest, most secure way to test and deploy code at any scale.
Our self-hosted agents work in your environment with any source code tool, platform and language including but not limited to Ruby, Xcode, Go, Node, Python, Java, Haskell, .NET or pre-release tools.