Stop Building Pipelines Like Scripts
Your pipeline isn't a script. Treat it like infrastructure.
Stop Building Pipelines Like Scripts
There's a pattern I keep running into. Someone writes a pipeline as a Python script. It pulls data from point A, transforms it, drops it into point B. It works. They move on.
Six months later, that script is running in production and nobody can tell you what happens when it fails halfway through. Does it retry? Does it pick up where it left off? Does it silently duplicate rows? Nobody knows, because it was never built to answer those questions.
The problem isn't the code. The problem is the mental model. A script runs once and finishes. A pipeline runs continuously and has to survive the real world.
What breaks in production
The list is always the same:
- The source API returns a 500 and the script crashes with no retry
- A partial batch writes to the destination, then the script fails, and the next run writes the full batch again
- Someone changes a column upstream and the transform silently drops it
- The script takes longer than expected and the next scheduled run starts before the first one finishes
- Jerry changes the global variable
DO_NOT_TOUCHand now column product.description is in pig latin...
Idempotency is the whole game
The single most important property a pipeline can have is idempotency. If you run it twice with the same input, you get the same output. No duplicates. No side effects.
This means:
- Use merge/upsert logic instead of blind inserts
- Track what you've already processed (watermarks, checkpoints, dedup keys)
- Design transforms so they're stateless when possible
def upsert_batch(db, records, conflict_key="id"):
for record in records:
db.execute(
f"""
insert into target ({', '.join(record.keys())})
values ({', '.join(['?'] * len(record))})
on conflict({conflict_key}) do update set
{', '.join(f'{k} = excluded.{k}' for k in record if k != conflict_key)}
""",
list(record.values())
)
This isn't fancy. It's just correct. And it means you can re-run any batch without worrying about what state the destination is in.
Error handling isn't optional
A pipeline that crashes on the first error is a liability. You need to decide, for every failure mode, what the right behavior is:
- Transient errors (network timeouts, rate limits): retry with backoff
- Bad records (schema mismatch, null where not expected): dead-letter queue, log it, keep going
- Fatal errors (auth failure, destination unreachable): fail loudly, alert, stop
Treat it like infrastructure
The shift that matters is thinking about your pipeline the way you'd think about a service. It needs:
- Health checks (is it running? is it current?)
- Observability (how long did the last run take? how many records?)
- Graceful degradation (can it skip a bad batch and keep going?)
- Clear ownership (who gets paged when it breaks?)