Data EngineeringAugust 18, 2025

Stop Building Pipelines Like Scripts

Your pipeline isn't a script. Treat it like infrastructure.

PythonPipelinesArchitecture

Stop Building Pipelines Like Scripts

There's a pattern I keep running into. Someone writes a pipeline as a Python script. It pulls data from point A, transforms it, drops it into point B. It works. They move on.

Six months later, that script is running in production and nobody can tell you what happens when it fails halfway through. Does it retry? Does it pick up where it left off? Does it silently duplicate rows? Nobody knows, because it was never built to answer those questions.

The problem isn't the code. The problem is the mental model. A script runs once and finishes. A pipeline runs continuously and has to survive the real world.

What breaks in production

The list is always the same:

The source API returns a 500 and the script crashes with no retry
A partial batch writes to the destination, then the script fails, and the next run writes the full batch again
Someone changes a column upstream and the transform silently drops it
The script takes longer than expected and the next scheduled run starts before the first one finishes
Jerry changes the global variable DO_NOT_TOUCH and now column product.description is in pig latin...

None of these are edge cases. They're Tuesday. If your pipeline doesn't handle them, it's not a pipeline. It's a demo.

Idempotency is the whole game

The single most important property a pipeline can have is idempotency. If you run it twice with the same input, you get the same output. No duplicates. No side effects.

This means:

Use merge/upsert logic instead of blind inserts
Track what you've already processed (watermarks, checkpoints, dedup keys)
Design transforms so they're stateless when possible

def upsert_batch(db, records, conflict_key="id"):
    for record in records:
        db.execute(
            f"""
            insert into target ({', '.join(record.keys())})
            values ({', '.join(['?'] * len(record))})
            on conflict({conflict_key}) do update set
            {', '.join(f'{k} = excluded.{k}' for k in record if k != conflict_key)}
            """,
            list(record.values())
        )

This isn't fancy. It's just correct. And it means you can re-run any batch without worrying about what state the destination is in.

Error handling isn't optional

A pipeline that crashes on the first error is a liability. You need to decide, for every failure mode, what the right behavior is:

Transient errors (network timeouts, rate limits): retry with backoff
Bad records (schema mismatch, null where not expected): dead-letter queue, log it, keep going
Fatal errors (auth failure, destination unreachable): fail loudly, alert, stop

The worst thing a pipeline can do is fail silently. The second worst thing is fail on every bad record when only one out of ten thousand is actually broken.

Treat it like infrastructure

The shift that matters is thinking about your pipeline the way you'd think about a service. It needs:

Health checks (is it running? is it current?)
Observability (how long did the last run take? how many records?)
Graceful degradation (can it skip a bad batch and keep going?)
Clear ownership (who gets paged when it breaks?)

When you build pipelines like scripts, you get scripts in production. When you build them like infrastructure, you get systems you can actually trust while you sleep.

Back to all posts