all posts
Data 8 min read

The petabyte pipeline that stays debuggable

A pipeline that processes a petabyte a day is not impressive if no one can explain why a number is wrong. The hard problem at scale isn’t throughput — it’s keeping the system legible to the people who operate it.

Every stage is addressable

Each transform writes a typed, versioned dataset to a known location. Any intermediate result can be queried directly, which means debugging is “look at stage 7’s output,” not “re-run the whole DAG and add print statements.”

val enriched = events
  .join(profiles, "user_id")
  .checkpoint("s3://lake/enriched/v3")

The checkpoint isn’t just a performance trick — it’s a debugging surface. Legibility is a feature you build in, not one you recover later.

// KEEP READING
Data

Scala 3 in anger: what changed for our data team

A year in. The migration cost, the features that earned their keep, and the ones we still don’t reach for.