A flaky test is one that passes and fails on the same code, with nothing changed, and most teams treat it as a minor irritation to be reran away. It is not minor.
It is also getting worse. Bitrise analysed more than 10 million CI builds over three and a half years and found the share of teams hitting flaky tests rose from 10 per cent in 2022 to 26 per cent in 2025, a 160 per cent increase, as pipelines grew more complex (SD Times, 2025). At Google's scale, research has found roughly one in seven test runs hits a flaky failure (Katalon, 2026).
The real damage is not the wasted minute on a rerun. It is what flakiness does to trust. Once a team learns that a red build is probably just noise, they stop reading failures, and the one failure that was real ships to production. This post is about why flaky tests are so expensive, and what actually fixes them, which is rarely another rerun.
The cost you see is reruns. The cost you do not is trust.
The visible cost is real. Flaky reruns consume 15 to 30 per cent of total CI time, and engineers spend 5 to 10 hours a week chasing failures that turn out to be noise (Katalon, 2026). But that is the smaller half. Wait time dominates compute by a wide margin, and flakiness is paid out of salaries and customer trust, not out of an invoice anyone reviews (CI/CD Watch, 2026).
The hidden cost is behavioural. Microsoft's analysis found that developers who encounter flaky tests become significantly less likely to investigate the next failure they see (Autonoma, 2026). Each unexplained red build that turns out to be nothing trains the team to assume the next one is nothing too.
The nuance is that some reruns are legitimate. A genuinely environmental blip does happen, and rerunning it once is reasonable. The problem is not the occasional rerun, it is rerunning as a reflex, because that reflex is the exact moment the signal stops meaning anything.
A test suite you do not trust is worse than no suite at all
This is the counterintuitive heart of it. A suite nobody trusts still costs you the time to run and maintain it, and on top of that it hands you false confidence, which is more dangerous than knowing you have no safety net. When CI is unreliable, teams quietly stop writing tests for new code on the grounds that it will just be flaky anyway, stop requiring green builds before merging, and the whole investment becomes sunk cost (FlakyGuard, 2026).
The scale of noise a suite can carry before that happens is striking. Slack's engineering team found flaky tests made up nearly 57 per cent of its CI failures before a dedicated effort brought that down to under 4 per cent (Autonoma, 2026). When more than half of your red builds are false alarms, no amount of asking people to be diligent will keep them reading the results.
The nuance is that the answer is not to delete your tests. A trustworthy suite is enormously valuable, which is precisely why its credibility is worth defending. Mozilla found that after fixing its flaky tests, developer confidence in the suite rose 29 per cent, with faster fixes and fewer escaped bugs (StickyMinds, 2025). Trust, once restored, pays itself back.
Flaky tests are not random, even though they feel random
The failures look random, but every flaky test has a deterministic root cause. The test is racing against timing, depending on shared state it should not depend on, or assuming an execution order that is not guaranteed. Large-scale research finds a remarkably consistent distribution: asynchronous wait and timing issues cause roughly 45 per cent of flaky tests, concurrency and resource contention about 20 per cent, and test order dependencies about 12 per cent, with the rest split between environment differences and non-deterministic logic (Autonoma, 2026).
Knowing the category is most of the battle, because each one has a known fix: a real wait for a condition instead of a fixed sleep, proper isolation instead of shared state, a stable way of finding elements instead of a brittle one. The hard part is rarely the fix itself.
The nuance is that the hard part is visibility. Most teams cannot say which of their tests are flaky or why, so the problem feels unsolvable and stays unsolved. With even a basic view of which tests fail inconsistently and how often, it turns from a vague malaise into a ranked list of engineering tasks.
Stop reaching for the rerun button
The instinctive response, rerun until green, is the one that makes everything worse. A retry hides the flake, hides any real regression sitting behind it, burns compute, and teaches the team to treat failure as normal. Retried workflows consume 15 to 30 per cent more compute (FlakyGuard, 2026), and every reflexive rerun is a small lesson that red does not mean broken.
Even Google, which does use targeted retries, has long argued that simply marking a test as flaky addresses the problem from the wrong direction and throws away information worth keeping. Their guidance is to have the test capture what it did, retry intelligently only for a known external cause, and if the failure reproduces, let it fail (Google, 2016).
The nuance is that retries are not banned. A bounded, deliberate retry while you investigate is fine, and for a genuinely external dependency it can be reasonable. The rule is simply that a retry buys you time to fix the root cause. It is not itself the fix, and treating it as one is how flakiness becomes permanent.
Measure it, quarantine it, then actually fix it
The workable sequence is unglamorous. Make flakiness visible with a flake score, quarantine the worst offenders into a separate non-blocking suite so they stop poisoning the signal, and then fix root causes on a deadline rather than someday. The first step alone pays off: teams that simply adopted monitoring tools saw 25 per cent fewer flaky reruns, not from fixing anything but purely from being able to see the problem (TestDino, 2026).
The fix step needs teeth. Microsoft's policy of fixing or removing a flaky test within two weeks cut its overall flakiness by 18 per cent in six months (TestDino, 2026). A deadline is what stops a quarantine list from quietly becoming a permanent parking lot.
The nuance, and it is an important one, is that quarantine is a mitigation, not a cure, and it carries a real risk. A quarantined test no longer protects the product, so a quarantine list left to grow becomes a slowly widening hole in your coverage (Functionize, 2026; minware, 2025). Every quarantined test needs an owner and an expiry, or you have not solved the problem, only hidden it more tidily.
The part worth sitting with
So the question is not whether you have flaky tests, because at any real scale you do, and the share of teams fighting them has more than doubled in three years. The question is what your team now does when a build goes red. If the honest answer is that they shrug, rerun, and move on, then your test suite has already stopped doing its job, and it is only a matter of time before a real failure rides through on the same shrug. The fix is not heroic. It is making flakiness visible, quarantining the worst of it with a deadline, and fixing the timing and state bugs underneath rather than retrying past them. The goal was never zero flaky tests. It was a red build that means something and a green build that means more. Until you can trust the signal, you are not saving time by skipping the investigation. You are just paying for the bug later, at full price.
Author note
I am Manjunaathaa, an Associate DevOps Engineer at Frigga Cloud Labs. I work across AWS, GCP, and Azure daily, with GitHub Actions as my deployment backbone. My focus is Proactive Resilience: treating the test suite as a feedback loop that has to stay trustworthy, because the moment it lies to you, every safeguard downstream is running blind. Every practice in this post is something I actually run in production, not something I read about. I wrote this because flakiness is the cheapest problem to ignore and the most expensive to leave alone. The thing I keep seeing is teams measuring everything about their pipeline except whether they still believe its results, and then wondering how a caught bug shipped. I track a flake rate the same way I track uptime, quarantine with an expiry date, and fix the timing bug rather than retry past it, because a green build I cannot trust is worth less than no build at all. Let's connect on LinkedIn → Manjunaathaa.
