The first time the warehouse-reconciliation job failed in late 2014, it failed at 03:11 on a Sunday morning and woke a contractor named Pieter, who solved it the way most of us would: he wrapped the offending block in a try/except, logged the exception to a file no dashboard read, and went back to bed. The ticket closed on Monday with the words "intermittent — handled gracefully." Pieter is no longer with the company. The exception is still being silently logged. In April 2023 it surfaced again, this time as a corrupted manifest that cost a customer in Bremerhaven a €74,000 detention charge on a container ship.
This issue is a catalogue of eight such episodes, drawn from a codebase we have been responsible for since 2014. The criteria for inclusion were narrow on purpose: the fix had to be applied, the ticket had to close, and the original bug, or a sibling that traced to the same misunderstanding, had to return at least once after a quiet stretch of six months or longer. Six of our eight entries went dormant for more than two years. One, the time-zone case in §02-G, slept for eleven.
We are not trying to be clever about hindsight. Each fix is shown beside the data we had at the time, and at every step we ask what we should have noticed. The answer, with embarrassing regularity, is that the signal was there — in a graph nobody scrolled to, in a comment a junior engineer left and a senior engineer waved away, in a flaky test that was muted rather than understood. We list those signals plainly. They are the most useful part of this issue.
At the end, in §06, we have written down the seven heuristics we now use to smell out the kind of bug whose first fix will be the wrong one. They are not laws. They have failure modes. But they have, in the last eighteen months, caught three bugs whose ordinary fix would have looked perfectly reasonable. We are calling that progress.