Why I stopped trusting retry loops in production systems.
There's something comforting about a retry loop.
Wrap a block of code in try/except, sleep a little, try again. Feels responsible. Feels resilient.
It isn't.
Retries are often a smell.
Not always. But often.
They're a signal that something deeper isn't designed correctly - and instead of fixing the underlying structure, we're asking the system to "just try harder."
That's not resilience. That's optimism.
And optimism is not a strategy in production.
A retry makes you feel safe because it handles transient failure.
Network glitch? Retry.
Database timeout? Retry.
Temporary lock? Retry.
Sure.
But here's what most retry logic actually does:
The moment you rely on retries as your primary failure strategy, you've admitted the system isn't deterministic.
You're not solving the problem. You're gambling that the next attempt will be luckier.
If your operation isn't idempotent, retries are dangerous.
If your state transitions aren't atomic, retries are dangerous.
If your system doesn't know what "correct" looks like, retries are dangerous.
Because now you're not just retrying a failed action.
You're potentially duplicating side effects.
Sending two emails. Charging twice. Uploading the same file twice. Marking something complete that never finished.
Retries multiply uncertainty unless the underlying system is built around invariants.
In production systems - especially long-running background workers - what matters isn't how often you retry.
What matters is this:
If those are true, retries become boring.
If those are false, retries become dangerous.
The goal isn't "keep trying."
Make it safe to try again.
Retries amplify failure.
If a dependency is partially down and you have many workers retrying aggressively, you can turn a hiccup into a self-inflicted outage.
Your system starts fighting the thing it depends on.
A small timeout turns into a cascade. Logs explode. CPU spikes. Operators panic.
All because retry on failure was treated as a safety blanket instead of a last resort.
Retries are correct when:
Retries should exist inside a design that is already safe.
They should not be compensating for one that isn't.
Stop asking, "How many times should we retry?"
Start asking, "Why can this fail in a way that leaves the system confused?"
If every operation is safe to re-run, completion is explicit, and truth is durable, retries become boring implementation details.
Boring systems survive.
If you can't answer those clearly, the retry isn't your fix.
It's your warning light.