Fixes that fixed the wrong thing — The Patchwork Review, Issue 14

§01 · Editor's note

The bandage
and the wound.

When a fix lands and the on-call pager goes quiet, we tend to treat that silence as proof. It isn't. It is, more often than we like, a coincidence inside a one-week window.

The first time the warehouse-reconciliation job failed in late 2014, it failed at 03:11 on a Sunday morning and woke a contractor named Pieter, who solved it the way most of us would: he wrapped the offending block in a try/except, logged the exception to a file no dashboard read, and went back to bed. The ticket closed on Monday with the words "intermittent — handled gracefully." Pieter is no longer with the company. The exception is still being silently logged. In April 2023 it surfaced again, this time as a corrupted manifest that cost a customer in Bremerhaven a €74,000 detention charge on a container ship.

This issue is a catalogue of eight such episodes, drawn from a codebase we have been responsible for since 2014. The criteria for inclusion were narrow on purpose: the fix had to be applied, the ticket had to close, and the original bug, or a sibling that traced to the same misunderstanding, had to return at least once after a quiet stretch of six months or longer. Six of our eight entries went dormant for more than two years. One, the time-zone case in §02-G, slept for eleven.

We are not trying to be clever about hindsight. Each fix is shown beside the data we had at the time, and at every step we ask what we should have noticed. The answer, with embarrassing regularity, is that the signal was there — in a graph nobody scrolled to, in a comment a junior engineer left and a senior engineer waved away, in a flaky test that was muted rather than understood. We list those signals plainly. They are the most useful part of this issue.

At the end, in §06, we have written down the seven heuristics we now use to smell out the kind of bug whose first fix will be the wrong one. They are not laws. They have failure modes. But they have, in the last eighteen months, caught three bugs whose ordinary fix would have looked perfectly reasonable. We are calling that progress.

§02 · The catalogue

Eight fixes that
fixed the wrong thing.

Each entry: the symptom we saw, what we changed, why it appeared to work, what we should have seen, and what the relapse cost when it finally came due.

Case A · filed 2014-11-09 · returned 2023-04-02

The reconciliation job that "handled" its own corruption.

Original symptom: Nightly warehouse-reconciliation crashed at line 412 of recon.py with a KeyError on the "sku_alt" column when an inbound EDI file arrived without it. About once every nine days.
What we changed: Wrapped the offending block in try/except KeyError, logged "sku_alt missing, skipping row" to /var/log/halberd/recon.log, and continued processing.
Why it looked fixed: The job stopped crashing. On-call paging dropped from one wake-up every nine days to zero in the first six weeks. Pieter was praised in a Friday standup and bought a round.
What we should have seen: Three things. (1) sku_alt wasn't optional in the EDI spec — the senders were dropping it when their own export crashed. (2) Skipping a row silently corrupted the inventory count by between 4 and 70 units per missing row. (3) Nobody had ever read recon.log; it wasn't shipped to the central aggregator.
What the relapse cost: €74,000 detention, four weeks of audit work across two teams, and one customer churn worth roughly €310k ARR. Total: eight and a half years of silent drift surfaced in a single container.

Case B · filed 2016-02-22 · returned 2019-08-17

The retry loop that papered over a deadlock.

Original symptom: Order-placement endpoint returned HTTP 500 on roughly 0.3% of requests during the Tuesday-morning peak. Database logs showed deadlock detected on the orders table.
What we changed: Added a tenacity retry with three attempts and 250 ms exponential backoff around the db.session.commit() call. Closed the ticket as "retry on transient deadlock."
Why it looked fixed: The 500 rate dropped from 0.30% to 0.04%. Most retries succeeded on the second attempt. The dashboards were green for thirty-one months.
What we should have seen: The deadlock was caused by two SELECT FOR UPDATE statements taking row locks in opposite orders, depending on whether the customer had a coupon. The fix masked a 100% reproducible bug into a 0.04% noise floor — and the retry doubled the lock pressure when it did fire, causing a small but real tail-latency hump.
What the relapse cost: In August 2019 a promo campaign moved 38% of traffic into the coupon path. The deadlock became thermodynamic. 94 minutes of full checkout outage across two regions, an estimated €188k in lost GMV, and an emergency lock-ordering rewrite over a weekend.

Case C · filed 2017-06-30 · returned 2022-01-14

The cache TTL that bought time and lost truth.

Original symptom: Pricing service was hitting the upstream catalog API at 1,200 req/s during catalog refresh and exceeding its quota of 800 req/s. Catalog operator threatened to throttle us.
What we changed: Added a Redis cache in front of the catalog with a 6-hour TTL. Reduced upstream traffic to ~70 req/s. Beautiful.
Why it looked fixed: Quota complaints stopped on the day of deploy. Latency p95 fell from 240 ms to 38 ms. The cache hit ratio settled at 94.2%, which everyone agreed was an excellent number.
What we should have seen: The catalog operator had been publishing 24 price corrections a day, mostly within the first 90 minutes of a price change. By caching for 6 hours we guaranteed that a price correction took up to 6 hours and 0 minutes to reach checkout. The real bug was that we had no invalidation channel — not that the API was slow.
What the relapse cost: January 2022, a Polish supplier corrected a typo from €1.89 to €18.90 on a popular item. For 5 hours 41 minutes we sold 4,308 units at the wrong price. €73,400 in honored refunds, two days of finance rework, and a six-month project to build a proper invalidation bus.

Case D · filed 2015-09-04 · returned 2020-11-23

The flaky test, muted with a smile.

Original symptom: test_concurrent_dispatch_assignment failed on 6% of CI runs with an off-by-one in the assigned driver list. The fix would not reproduce locally.
What we changed: Marked it @pytest.mark.flaky(reruns=3). CI went green. The test passed 99.8% of the time after that.
Why it looked fixed: The pipeline turnaround dropped by 8 minutes. The eng-team Slack channel got noticeably quieter. We even celebrated.
What we should have seen: The 6% wasn't flakiness — it was a real race between the dispatch worker's claim() and the assignment service's update(), and it depended on which one acquired the Redis lease first. In production it manifested at roughly the same 6% rate but only on routes with more than 14 stops, which were rare in 2015 and common by 2020.
What the relapse cost: An entire week of Black-Friday dispatch had off-by-one driver lists on multi-stop routes. 2,140 packages delivered to a wrong-but-adjacent house, 18 angry customer-service days, and one regional partner contract not renewed (worth ~€420k).

Case E · filed 2018-03-12 · returned 2024-09-30

The nullable column that became a tombstone.

Original symptom: A new customer_segment column was added to the users table for personalization. Existing rows had no segment. The personalization service kept crashing on None.
What we changed: Coerced None to the string "default" at the boundary of the personalization service. Tests passed. Ticket closed.
Why it looked fixed: Crash rate fell to zero. Personalization features could be safely rolled out without backfilling 11 million existing rows.
What we should have seen: "default" was supposed to be a temporary stand-in until backfill, but the backfill never happened, and within eighteen months three other services were also writing "default" into the column, intentionally, for users they couldn't classify. The semantic of NULL had silently fused with the semantic of "unknown" with the semantic of "we gave up."
What the relapse cost: A GDPR-driven audit in late 2024 required us to demonstrate which users had been actively segmented vs. which were unclassified. We could not. €41k in legal fees, eleven weeks of two engineers reconstructing intent from access logs, and a public-facing remediation note.

Case F · filed 2013-07-01 · returned 2021-05-08

The "duplicate prevention" that buried a bigger duplicate.

Original symptom: Customers were occasionally receiving two confirmation emails per order. Roughly 1 in 4,000 orders.
What we changed: Added a unique constraint on (order_id, email_template_id) in the sent_emails table. Duplicate-send rate went to zero.
Why it looked fixed: No more duplicate emails. Support tickets about "I got two confirmations" stopped within a week. The constraint was a fine piece of database hygiene.
What we should have seen: The duplicates were happening because two separate code paths — the order-completed handler and a legacy webhook from the payment processor — were both firing on success. The constraint stopped the email but it did not stop the second code path, which was also incrementing loyalty points, calling the warehouse, and emitting a duplicate analytics event.
What the relapse cost: By 2021 a loyalty-points audit found €96,000 in over-credited points across approximately 27,000 customers, plus a permanent 0.4% inflation in our reported conversion metrics that had been quietly impressing executives for almost eight years.

Case G · filed 2012-10-30 · returned 2023-12-18

The timestamp that lived in Copenhagen.

Original symptom: Scheduled deliveries occasionally showed up an hour early or late around the last Sunday of October and March.
What we changed: Added a one-hour offset adjustment when the server's localtime() reported DST in transition. Deployed Halloween 2012.
Why it looked fixed: The next DST event passed without complaint. The next one after that, also fine. For eleven years and two months, this was widely considered "the DST fix" and was cited by name in the on-call runbook.
What we should have seen: The application server was in Copenhagen. So were the drivers. So were the customers. When we opened a Lisbon warehouse in mid-2023 the offset adjustment, which had been silently assuming Europe/Copenhagen, started applying to Iberian deliveries with confidence and incorrectness.
What the relapse cost: Two weeks of Iberian deliveries scheduled an hour off during the autumn transition. ~3,800 missed delivery windows, €52k in compensation vouchers, a hard rewrite of the scheduling layer to use UTC everywhere, and a quiet promise to never again hardcode an assumption that wasn't even visible in the code.

Case H · filed 2019-04-18 · returned 2023-08-02

The OOM that wasn't memory at all.

Original symptom: The report-generation service was being OOM-killed every 90 minutes under load. Heap dumps showed large in-memory result sets.
What we changed: Doubled the container memory from 2 GB to 4 GB. The OOM stopped. Cost increase per pod: about €14/month. Acceptable.
Why it looked fixed: Restart frequency went from every 90 minutes to roughly never. A second engineer suggested streaming the reports instead, was thanked, and the suggestion was filed in a tech-debt board nobody groomed.
What we should have seen: The reports were not large because we held them in memory; we held them in memory because the report code path was synchronously building a 1.2-million-row dataframe to render a 200-row CSV. The 4 GB ceiling deferred the moment when "synchronous build" met "real customer volume" by exactly four years and four months.
What the relapse cost: Summer 2023, the largest customer requested year-over-year reports. The 4 GB containers OOM'd. The 8 GB containers OOM'd. The 16 GB containers OOM'd, slower. Sixteen days of engineering to rewrite the path to stream, plus an apology call from a director.

§ ❦ §

§03 · Tally

The shape of returning bugs.

Eight rows, ordered by dormancy. The "role" column describes how the original fix functioned, not whether it was a good idea. Almost all of them were defensible.

#	Case	Filed	Returned	Dormancy	Role of original fix	Cost at return
G	The Copenhagen timestamp	2012-10-30	2023-12-18	11y 2m	Patch hardcoded local assumption	€52,000
A	Silenced reconciliation crash	2014-11-09	2023-04-02	8y 5m	Mask swallowed exception	€384,000
F	Duplicate-email constraint	2013-07-01	2021-05-08	7y 10m	Wall caught symptom, not source	€96,000
E	Nullable-to-"default" coercion	2018-03-12	2024-09-30	6y 6m	Patch erased semantic state	€41,000
D	Muted concurrency test	2015-09-04	2020-11-23	5y 2m	Mask muted real race	€420,000
C	6-hour pricing cache	2017-06-30	2022-01-14	4y 6m	Wall bought latency, lost freshness	€73,400
H	OOM — doubled memory	2019-04-18	2023-08-02	4y 4m	Patch deferred algorithmic cost	€36,000
B	Retry around deadlock	2016-02-22	2019-08-17	3y 6m	Mask redrew the noise floor	€188,000

Roles: MASK — the fix hid evidence of the bug. PATCH — the fix accepted a worse contract to ship. WALL — the fix isolated downstream but left the upstream intact.

"We didn't fix the deadlock. We fixed the way we looked at the deadlock. For thirty-one months that worked, which is the most dangerous duration in software."

— engineer J.S., 2019 postmortem, redacted

§04 · Anatomy

How a relapse
tends to unfold.

Across the eight cases, the same five phases show up with eerie consistency. We mark each with the observation we should have read at the time, and didn't.

I.

Day 0 · the original symptom

An honest, narrow alert.

A real error fires. Stack trace, timestamp, ticket. The signal is precise: a column missing, a deadlock, a memory ceiling. The team gathers and the conversation is healthy, because the bug is small and the room is calm.

Observation we missed The phrasing of the alert is too specific to be the whole story. "KeyError on sku_alt" is a fact about a column. It is not yet a fact about why that column went missing or what happens downstream if it does.

II.

Day 0 to Day 7 · the cheap fix

A patch that ships before lunch.

Someone wraps, retries, caches, mutes, raises a limit, or coerces a value. Code review is fast because the diff is small. Tests pass because the test only knew about the narrow symptom. The fix earns a kind round of approving emoji.

Observation we missed The diff has no new test that captures the underlying invariant — it only re-greens the existing one. If the patch lives on its own without a structural test, it is almost always a mask.

III.

Day 7 to Year 1 · the quiet

The dashboard agrees.

Charts go flat in the good direction. The original alert never fires again. The team's confidence in the fix grows by the day, because the rate of disconfirming evidence is, by construction, zero: the fix has removed the channel through which the bug used to speak.

Observation we missed A graph going to exactly zero is almost never the natural shape of a real-world phenomenon. Zero often means we silenced the sensor, not the source.

IV.

Year 1 to Year N · drift

The environment moves.

Volume grows. A new region opens. A campaign shifts traffic into a once-rare code path. A schema gains a column. The fix is still defensible against its original conditions, but the conditions have left it behind. Nobody re-reads the fix because the ticket is closed.

Observation we missed The closed ticket is a contract with a specific world. The world breaks contracts. Nobody owns "re-read the closed tickets when the world changes" — and that is the gap relapses live in.

V.

Day X · return

Same wound, louder, older.

The bug comes back, now compounded by years of downstream decisions that took the broken state for granted. Loyalty points have been counted. Prices have been quoted. Drivers have been routed. The cost of the relapse is not the cost of the original bug — it is the cost of everything that built on top.

Observation we missed Compound cost grows exponentially with dormancy. A bug that costs €1 to fix on day zero rarely costs €100 a year later; it costs €1, then €1, then €4,000, then €74,000, then a customer.

§05 · Workbench

Three tools we
now lean on.

None of these are new. All of them were available in 2014. We are not proud that we needed eight relapses to start using them, but we are using them.

01

Invariant tests, not regression tests.

Every fix must include at least one test that names the invariant the bug violated, not just a test that re-runs the failing input. If the only test is "the original ticket no longer reproduces," the fix is suspect.

Adopted Q3 2023. Caught 11 weak fixes in the first six months, including a near-repeat of Case C.

Owner: platform team · cost ≈ 8 eng-hr / week

02

The "what would have to change for this to break again" memo.

Two paragraphs, mandatory on every postmortem, kept in the ticket. List the assumptions the fix depends on. Volume? Region? Schema shape? A coupon flag? Anything we hardcoded? Anything we deferred?

The memo from Case G would have read: "assumes Europe/Copenhagen, breaks the day we open another timezone." We opened Lisbon in 2023.

Owner: ticket author · cost ≈ 10 minutes

03

Re-reading the closed tickets, deliberately.

One Friday a month, two engineers read ten randomly-sampled closed tickets older than a year and ask, in writing: does this fix still work given the world as it is today? The answer is "no, actually" about 14% of the time.

It has, so far, prevented two relapses we can name. It has surfaced four we still need to address.

Owner: rotating · 90 minutes a month

§06 · Smells

Root-cause smells:
a field handbook.

Seven heuristics that, on average, tell us when the next fix we are about to write is going to be in the next edition of this issue. They are blunt by design.

01 · The fix is a verb that means "ignore."

Try/except without a re-raise, flaky(reruns=), continue, pass, // nolint, an empty catch block, a muted alert, a default value substituted at the boundary. If your diff makes the system not see the bug, the bug is still there.

02 · The graph goes to exactly zero.

Real phenomena have a noise floor. When a previously bursty metric flatlines to 0 after a fix, treat that as suspicious until you can explain the noise source you removed. Often, you removed the sensor.

03 · The fix is upstream of the cause.

If the bug is in service A's data and you fixed it by changing service B's reader, you've built a wall. The next consumer of service A's data will hit the same wound from a different angle. Cases C, E, F.

04 · The fix encodes a constant nobody named.

Magic numbers, magic strings, magic timezones, magic timeouts. If a future engineer can't tell from the code why the constant is what it is, they cannot tell when the constant is no longer right. Case G slept for 11 years on a hardcoded local time assumption.

05 · The reproducer is "intermittent."

The word "intermittent" in a ticket title is, in our data, the single strongest predictor of a relapse. Of the 312 tickets we re-opened more than once, 71% had the word "intermittent," "flaky," or "occasionally" in the original title.

06 · The fix changes a limit, not a shape.

Doubling memory, doubling timeout, doubling retries, raising a quota, raising a connection pool. These move the cliff, they don't remove it. Acceptable as triage; not acceptable as the closing change. Case H.

07 · The fix arrives without a new test that fails first.

This is the simplest one and the one we break most often. If you cannot write a test that fails on the unpatched code and passes on the patched code, you do not know what you fixed. You only know what stopped complaining.

+1 · You feel relieved before you feel curious.

Not really a rule, but we kept noticing it. The pattern across all eight cases is the same: relief that the alert is quiet, followed by no further questions. Curiosity is the cheapest debugging tool we own. We forget to use it most when we need it most.

§07 · Letters & errata

A quiet apology,
and some answers.

Closing remarks, three letters we received during the drafting of this issue, and the questions readers ask us most.

If you worked on Halberd between 2012 and 2020 and you recognize yourself in one of these eight cases, please know that we are not writing about you. We are writing about us. Every one of these fixes was the right answer to a specific question at a specific moment, and the engineers who shipped them were doing the job they had been asked to do, on a deadline that was real, with information that was incomplete. We owe most of them either coffee or a beer, and we still believe most of them shipped the best change available to them at the time.

What we got wrong was structural. We treated closed tickets as solved problems and let the documentation of why they closed evaporate with the slack thread. We treated green dashboards as evidence and let our own confidence build a moat around the kind of doubt that finds returning bugs. We treated relief as a feeling, when it is in fact a smell.

The point of this catalogue is not to flagellate. It is to convince the next on-call engineer — who will be paged at 03:11 on a Sunday and offered a small, defensible fix that wraps a real signal in a polite catch block — to pause for ninety seconds and ask a different question: what did this bug just tell me, and am I about to make it stop telling me without changing what it knew?

— Margit V., editor, June 2026

Q.1Are you sure these eight are representative of the codebase as a whole?

No. We selected the eight clearest narratives from a pool of 312 reopened tickets. There is selection bias toward cases with neat causal stories. A more honest portrait would include the 40 or so cases where the relapse cause is genuinely ambiguous — some are in next issue's draft.

Q.2Doesn't every fix mask the deeper problem in some sense?

Yes, in the same sense that every map omits some terrain. The honest version of this question is: are you masking on purpose, with documentation of what you've masked, or are you masking by accident? Cases A and D are the second kind. They are the ones that come back.

Q.3How do you justify the cost of the Friday ticket-review ritual?

Two engineers, 90 minutes, once a month. That is 36 engineer-hours per year, roughly €4,000 fully loaded. The two relapses we know it has prevented were each at least €60k in our recent history. The ROI math is generous to a fault. The harder cost is cultural: it forces senior engineers to sit with old code that is not glamorous.

Q.4Why so many of these involve concurrency or distributed-systems edge cases?

Because the noise floor is higher there. Concurrency bugs are uniquely vulnerable to being papered over, because their reproducer is probabilistic, and a fix that reduces the probability can look indistinguishable from a fix that eliminates the cause. See Cases B and D for the canonical shapes.

Q.5Did you consider hiring or tooling that could have caught these?

We considered, in chronological order: Sentry (adopted 2017, would not have caught A), Honeycomb (2019, would have caught H sooner), a contracted SRE practice (2021, declined), and Polar-grade chaos testing (2023, in progress). The most effective intervention has been the smallest one: the invariant test rule in §05. Tooling matters less than the question you ask when you write the test.

Q.6What's the worst possible misread of this issue?

"Never apply quick fixes." That would be wrong, expensive, and out of touch with how real software is shipped. Quick fixes are necessary, frequently correct, and almost always the right first response. The issue is what happens in the hour, day, and week after — specifically, whether the fix earns a follow-up that names the real cause and re-tests the invariant. Most don't.

Q.7Will there be an Issue 15?

Yes, in October. The working title is "Bugs we caused while writing tests." We have, depressingly, more material than we expected.

The bandageand the wound.

Eight fixes thatfixed the wrong thing.

The reconciliation job that "handled" its own corruption.

The retry loop that papered over a deadlock.

The cache TTL that bought time and lost truth.

The flaky test, muted with a smile.

The nullable column that became a tombstone.

The "duplicate prevention" that buried a bigger duplicate.

The timestamp that lived in Copenhagen.

The OOM that wasn't memory at all.

The shape of returning bugs.

How a relapsetends to unfold.

An honest, narrow alert.

A patch that ships before lunch.

The dashboard agrees.

The environment moves.

Same wound, louder, older.

Three tools wenow lean on.

Invariant tests, not regression tests.

The "what would have to change for this to break again" memo.

Re-reading the closed tickets, deliberately.

Root-cause smells:a field handbook.

01 · The fix is a verb that means "ignore."

02 · The graph goes to exactly zero.

03 · The fix is upstream of the cause.

04 · The fix encodes a constant nobody named.

05 · The reproducer is "intermittent."

06 · The fix changes a limit, not a shape.

07 · The fix arrives without a new test that fails first.

+1 · You feel relieved before you feel curious.

A quiet apology,and some answers.

The bandage
and the wound.

Eight fixes that
fixed the wrong thing.

How a relapse
tends to unfold.

Three tools we
now lean on.

Root-cause smells:
a field handbook.

A quiet apology,
and some answers.