← All PostsHome ↑Contact →
Data EngineeringDecember 19, 2025

IaC for Data

Your Pipeline Isn't Code Until It's in Git

DatabaseArchitectureSystem DesignData SystemsData

IaC for Data

Your Pipeline Isn't Code Until It's in Git

The Fragmentation Problem

At some point, every data platform reaches the same moment of reckoning.

You're onboarding a new engineer. You sit down to walk them through the stack. And somewhere between "the RDS schemas were set up manually," "dbt is configured in this YAML here but also this other YAML over here," "Airflow runs on that VM that Todd set up in 2021," and "the S3 buckets just... exist"... you realize the truth: you don't have a data platform. You have a collection of tools that happened to survive.

This is the fragmentation problem. And it's not a tooling problem... it's a philosophy problem.

Every layer of the modern data stack accumulates its own form of drift. Schemas mutate in place. Transformation logic lives in a framework that owns the execution model. Pipelines are defined in orchestration DSLs that blur the line between configuration and code. Access policies are applied ad-hoc and never written down. The result is a platform that only the person who built it can navigate... and even they're not sure anymore.

The industry response to this has been more tools. More abstraction. More managed services. More config formats. But the problem was never that we lacked tools. The problem is that we stopped treating our data infrastructure like software.

Infrastructure as Code is the correction. But most data teams apply it narrowly... maybe they Terraform the cloud resources and call it done. The deeper idea is more radical: every layer of your data stack should be declared, version-controlled, and reproducible. Storage. Transformation. Orchestration. Access. All of it. In Git.

IaC Is a Philosophy, Not a Tool

Let's be precise about what Infrastructure as Code actually means, because the term gets diluted fast.

IaC is not Terraform. Terraform is an implementation. IaC is a mental model:

Desired state is declared. That declaration is version-controlled. The system converges to it.

That's the whole idea. And once you internalize it, you realize it applies far beyond cloud resources. It applies to database schemas. It applies to transformation logic. It applies to how pipelines are deployed. It applies to who has access to what.

The discipline it enforces is what matters:

  • Declared: the state of the system is written down somewhere explicit, not held in someone's head or reconstructed from click history
  • Version-controlled: changes are reviewable, reversible, and attributable
  • Reproducible: given the declarations, you can rebuild the system from scratch
When you hold your data stack against these three tests, most of it fails immediately. The RDS schema was built by hand. The transformation logic is locked inside a framework with its own runtime. The pipeline schedules live in a UI. The IAM grants were applied once and never documented.

None of that is code. It's just state... fragile, undocumented, and slowly drifting from whatever you think it is.

The goal isn't to adopt new tools. The goal is to collapse tool surface area until every layer of your stack is something Git can reason about. A .tf file. A Python module. A Dockerfile. A SQL migration file. Things with diffs. Things with history. Things you can review in a PR.

The Four Layers... Done Right

Layer 1: Storage → Terraform

The storage layer is where IaC practices are most mature, and for good reason... cloud infrastructure tooling is excellent. Terraform and Pulumi give you declarative, plan-before-apply control over the resources that underpin everything else.

But data teams often stop at VPCs and S3 buckets. The same discipline should extend to your database schema, your RDS instances, and your storage topology.

# rds.tf

resource "aws_db_instance" "analytics" {
identifier = "analytics-db"
engine = "mysql"
engine_version = "8.0"
instance_class = "db.t3.medium"
db_name = "analytics"
username = var.db_username
password = var.db_password
allocated_storage = 50
storage_type = "gp3"

backup_retention_period = 7
skip_final_snapshot = false
final_snapshot_identifier = "analytics-db-final"

tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}

resource "aws_db_subnet_group" "analytics" {
name = "analytics-subnet-group"
subnet_ids = var.private_subnet_ids

tags = {
ManagedBy = "terraform"
}
}

This isn't just documentation. This is the system state. When a new environment needs to come up... staging, a feature branch environment, disaster recovery... it comes from this file, not from a series of manual steps someone wrote in a Confluence doc three years ago.

The terraform plan output is your diff. The PR is your review. The merge is your audit trail.

Schema migrations deserve the same treatment. Raw SQL migration files, numbered sequentially, applied in order, tracked in a migrations table. No magic. No framework. Just files in a directory and a script that runs them.

-- migrations/0012_add_customer_segment.sql

ALTER TABLE marts.customers
ADD COLUMN segment VARCHAR(50) DEFAULT 'standard';

UPDATE marts.customers
SET segment = 'enterprise'
WHERE annual_revenue > 1000000;

# scripts/migrate.py

import mysql.connector
from pathlib import Path

def run_migrations(conn, migrations_dir: str):
cursor = conn.cursor()

cursor.execute("""
create table if not exists _migrations (
id int auto_increment primary key,
filename varchar(255) not null unique,
applied_at timestamp default current_timestamp()
)
""")

applied = {row[0] for row in cursor.execute("select filename from _migrations") or []}
pending = sorted(Path(migrations_dir).glob("*.sql"))

for migration in pending:
if migration.name in applied:
continue
print(f"Applying {migration.name}...")
cursor.execute(migration.read_text())
cursor.execute("insert into _migrations (filename) values (%s)", (migration.name,))
conn.commit()
print(f"[SUCCESS] {migration.name}")

That's it. No ORM. No migration framework. SQL files that are readable, reviewable, and reproducible.

Layer 2: Transformation → Pure Python

This is where most data teams make the wrong turn.

The instinct is to reach for a transformation framework... dbt, SQLMesh, something that gives you a project structure and a run command. And frameworks aren't inherently wrong. A well-structured dbt project in Git, with version-controlled models and tested macros, is doing IaC in spirit. The problem isn't the framework itself... it's when the framework becomes load-bearing infrastructure that can't be reasoned about outside its own runtime.

That's the line to watch. When your lineage graph only exists inside the framework's metadata store. When your tests are defined in a YAML schema that only the framework's CLI can evaluate. When everything is all mediated by a single tool... you've crossed from "using a framework" into "being owned by a framework." You've traded legibility for convenience, and the bill comes due when the framework's opinions diverge from yours, or when you need to debug something at 2 AM without the framework running ...ugh...

The question to ask isn't "should I use dbt?" It's: can I reason about this transformation layer using only Git, a text editor, and standard tooling? If the answer is no... if understanding the system requires running the framework... then the framework has become a dependency of your platform in a way that undermines the IaC principle.

The alternative is simpler than it sounds: write Python...

Not a Python wrapper around SQL. Not a Pythonic DSL that generates SQL. Just Python... functions and classes that transform data, with inputs and outputs, tested like any other software.

# transforms/customers.py

import polars as pl
from typing import Protocol

class CustomerSource(Protocol):
def read(self) -> pl.DataFrame: ...

def enrich_customers(source: CustomerSource) -> pl.DataFrame:
"""
Enrich raw customer records with derived fields.
"""
df = source.read()

return (
df
.filter(pl.col("deleted_at").is_null())
.with_columns([
pl.when(pl.col("annual_revenue") > 1_000_000)
.then(pl.lit("enterprise"))
.when(pl.col("annual_revenue") > 100_000)
.then(pl.lit("mid-market"))
.otherwise(pl.lit("standard"))
.alias("segment"),

(pl.col("created_at").dt.year() == pl.lit(2026)).alias("is_new_this_year")
]).select([
"customer_id",
"name",
"email",
"segment",
"annual_revenue",
"is_new_this_year",
"created_at",
])
)

# tests/test_customers.py

import polars as pl
from transforms.customers import enrich_customers

def test_enterprise_segmentation():
class MockSource:
def read(self):
return pl.DataFrame({
"customer_id": [1, 2, 3],
"name": ["Acme", "Beta", "Gamma"],
"email": ["a@acme.com", "b@beta.com", "c@gamma.com"],
"annual_revenue": [2_000_000, 500_000, 50_000],
"created_at": ["2026-01-15", "2025-06-01", "2024-03-22"],
"deleted_at": [None, None, None],
})

result = enrich_customers(MockSource())

assert result.filter(pl.col("customer_id") == 1)["segment"][0] == "enterprise"
assert result.filter(pl.col("customer_id") == 2)["segment"][0] == "mid-market"
assert result.filter(pl.col("customer_id") == 3)["segment"][0] == "standard"

def test_deleted_customers_excluded():
class MockSource:
def read(self):
return pl.DataFrame({
"customer_id": [1, 2],
"name": ["Active", "Deleted"],
"email": ["a@co.com", "d@co.com"],
"annual_revenue": [100_000, 200_000],
"created_at": ["2026-01-01", "2025-01-01"],
"deleted_at": [None, "2025-06-01"],
})

result = enrich_customers(MockSource())
assert len(result) == 1
assert result["customer_id"][0] == 1

This is just software. You test it with pytest. You lint it with ruff. You type-check it with mypy. It goes in Git like everything else. No framework required. No DSL to learn. No abstraction layer between you and your logic.

The transformation layer becomes a Python package. A collection of pure functions that take data in and return data out. Composable. Testable. Portable. The dependency is Python... which you already have.

Layer 3: Orchestration → Containers on Kubernetes / ECS

Most orchestration tools sell you two things: a scheduler and a runtime. The problem is the runtime half.

When your pipeline logic lives inside Airflow, your pipeline and your scheduler are coupled. The DAG definition, the execution environment, the retry logic, the dependency graph... all of it is mediated by the orchestrator. Change the orchestrator, rewrite the pipelines. Scale the platform, scale the orchestrator. The orchestrator becomes the platform.

The IaC-native model inverts this: your pipeline is a container. The orchestrator just runs containers.

Each pipeline is an independent unit... its own image, its own dependencies, its own entrypoint. The scheduler (Kubernetes CronJob, ECS Scheduled Task, whatever) knows two things: when to run it, and what image to run. That's the entire interface.

# pipelines/customers/main.py
"This is the container entrypoint... just a Python script."

import logging
from transforms.customers import enrich_customers
from sources.rds import RDSSource
from sinks.rds import RDSSink

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

def run():
log.info("Starting customer enrichment pipeline")

source = RDSSource(table="raw.customers")
sink = RDSSink(table="marts.customers")

df = enrich_customers(source)
sink.write(df)

log.info(f"Pipeline complete. {len(df)} records written.")

if __name__ == "__main__":
run()

# pipelines/customers/Dockerfile

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT ["python", "-m", "pipelines.customers.main"]

# k8s/customers-cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
name: customers-enrichment
namespace: data-pipelines
spec:
schedule: "0 3 *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: customers-enrichment
image: your-registry/customers-pipeline:${IMAGE_TAG}
envFrom:
- secretRef:
name: rds-credentials
backoffLimit: 2

The power of this model is pipeline independence. Each pipeline has its own image. Its own dependency tree. Its own version. You can update the customers pipeline without touching the orders pipeline. You can roll back a single pipeline by pointing the CronJob at the previous image tag. You can run any pipeline locally by running its container... no orchestration platform required.

The Kubernetes manifest or ECS task definition is your declaration. It lives in Git. A PR to change a schedule or bump an image tag is the change process. The cluster is just a runtime.

This also means your local development environment and your production environment are the same thing... a container. The gap between python main.py on your laptop and the production CronJob is just the image registry.

Layer 4: Access / Policy → Grants as Code

This is the layer that gets skipped. And it's the layer that bites you hardest... at audit season, at an incident, or when someone wonders why the analytics intern has read access to the payments table.

Access policy is infrastructure. It should be declared, versioned, and applied the same way everything else is.

# access/analytics_team.tf

resource "aws_iam_role" "analyst" {
name = "analyst-role"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})

tags = {
Team = "analytics"
ManagedBy = "terraform"
}
}

resource "aws_iam_policy" "analyst_rds_read" {
name = "analyst-rds-read"
description = "Read-only access to analytics RDS instance"

policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"rds:DescribeDBInstances",
"rds:DescribeDBClusters",
"rds-db:connect"
]
Resource = aws_db_instance.analytics.arn
}]
})
}

resource "aws_iam_role_policy_attachment" "analyst_rds_attach" {
role = aws_iam_role.analyst.name
policy_arn = aws_iam_policy.analyst_rds_read.arn
}

When a new team member joins, it's a PR. When someone leaves, it's a PR. When the auditor asks who had access to what and when... it's a git log.

The access layer isn't glamorous. But it's the difference between a platform you can reason about and a platform that is quietly accumulating invisible risk.

The Unifying Pattern

Step back and look at what all four layers have in common.

Every one of them follows the same loop:

declare → plan → apply → observe
  • Declare the desired state in a file
  • Plan the delta between current state and desired state
  • Apply the change
  • Observe the result
Terraform calls it plan and apply. A migration script calls it "pending" and "applied." A container deployment calls it an image diff and a rollout. An IAM grant is a Terraform resource with a known state.

When every layer of your stack speaks this language, something important happens: the platform becomes legible. Any engineer... data, backend, platform, security... can look at the repository and understand the system. Not because they learned your stack, but because the stack is just code.

This is the promise of IaC applied seriously. Not just "we used Terraform for the cloud bits." The whole thing. Declared. Versioned. Reproducible.

What Done Looks Like

When you've applied this philosophy end to end, a few things become true:

A new environment is a CI run. Staging, a feature branch env, a disaster recovery clone... none of them require manual steps. terraform apply -var-file=staging.tfvars, migrations run, pipelines deployed. Done.

Schema changes are PRs. Someone wants to add a column. They write the migration file, open a PR, get it reviewed, merge it. The migration runs in CI against staging. It runs again against production on deploy. The change is in Git forever.

A pipeline update is an image tag. You change the transformation logic, cut a new image, update the CronJob manifest. The change is reviewable. If it breaks, you change the tag back.

Access is auditable. Who has access to what is a Terraform state file and a git history. No guessing. No asking around.

You can onboard an engineer in an afternoon. Not because you wrote good documentation... because the system is self-documenting. It's code. They can read it.

Stop Using Tools. Write Code!

The data engineering ecosystem has a tool for everything. A tool to define transformations. A tool to schedule pipelines. A tool to document lineage. A tool to manage schemas. A tool to govern access.

Every tool you add is a surface area to maintain, a DSL to learn, a config format to debug, an abstraction that will eventually leak. Every tool is a point of fragmentation... another place where the real state of your system diverges from what you think it is.

The discipline of IaC isn't about tooling. It's about reducing the gap between what you think your system is and what it actually is. It's about making your data platform legible... to your team, to your future self, to an engineer who's never seen it before.

You close that gap by writing code. Real code. Python that you can test. SQL that you can diff. Terraform that you can plan. Dockerfiles that you can build. Manifests that you can review.

Not tools. Code.

If it's not in Git, it's not infrastructure. It's just hope.