Skip to main content
Back to writing

Eamon Boyle Writing

Data systems7 min read

Deterministic ingestion pipelines are a feature, not just plumbing

What changed when I stopped treating imports as background glue and started designing them like product features with clear failure surfaces.

C#.NETSQL ServerETLOperations

Most internal systems accumulate imports the same way garages accumulate mystery cables: one script for this supplier, a half-manual runbook for that spreadsheet, and an API job everyone is afraid to touch because nobody is sure which table it writes to first.

That arrangement can limp along for a while. Then the cost arrives all at once:

  • failures show up late, after bad rows have already landed
  • retries create duplicates because the job was never designed to be replayed
  • support cannot tell whether the problem is the file, the source system, or the rules
  • every new feed becomes a mini greenfield project

I spent a chunk of time working on data ingestion in exactly that sort of environment. The useful change was not "make imports faster" in the abstract. It was to make them deterministic enough that people could trust what would happen before they pressed run.

The pipeline shape mattered more than the transport

CSV, Excel, and API payloads all look different on the outside, so teams often build different code paths for each. That feels pragmatic until the third or fourth source lands and you are maintaining a maze.

The better move was to standardize the lifecycle instead:

  1. intake the source
  2. normalize it into a common intermediate shape
  3. validate it with explicit business rules
  4. transform it into persistence-ready records
  5. write it in one controlled step

Once that shape exists, the source format becomes a detail rather than the architecture.

That was the biggest simplifier. Instead of discussing whether a feed was "an API import" or "a spreadsheet import", the team could talk about where in the lifecycle something failed.

Reject early, write late

The mistake I distrust most in ingestion work is partial persistence. If invalid rows can sit beside valid rows in the same run, debugging becomes archaeology.

I preferred a pipeline that validated first and persisted after the run had proven it was coherent. That design has a few nice properties:

  • rollback logic stays simple because there is less to roll back
  • operators do not have to ask which subset of rows made it through
  • support can rerun the job without treating the database like a crime scene

There are cases where partial acceptance is the right product decision. But if the domain cares about auditability and predictable retries, "all clear or clearly failed" is usually the saner default.

The failure surface is part of the product

A lot of ingestion systems technically log errors while still being unhelpful in practice.

"Import failed" is not an operational tool.

What the operator actually needs is something closer to:

Supplier feed X, file 2026-02-11.csv, row 1842, field episode_title: expected a known canonical value, received an empty string after trimming.

That kind of detail changes the conversation immediately. Instead of filing vague bugs or manually spelunking the database, the owner can locate the bad record and decide whether the source should be fixed, the rule should be relaxed, or the mapping should be updated.

In other words: the system should tell you what failed, where it failed, and under which rule it failed.

Configuration is not a dirty word

I like code, so I understand the temptation to encode every feed directly in the language. But once the pattern is established, per-source config can be a feature rather than a compromise.

For recurring feeds, configuration helped keep ownership obvious:

ConcernGood default
field mappingsource-specific config
validation rule implementationshared code
per-source togglesconfig
persistence semanticsshared code

That split prevented the system from turning into dozens of forked importers. It also made it easier to explain to non-authors how a new feed would be introduced.

Idempotency is not optional

Any job that can fail can be retried. If retries are part of reality, idempotency is part of the design.

For me that usually means being explicit about a run identity and defining what counts as "the same work" when a replay occurs. The answer varies by domain, but the requirement does not.

If the system cannot safely answer "what happens if I run this again?", the operator is being asked to gamble.

Boring systems win

One reason I like this class of problem is that it rewards a slightly unfashionable kind of engineering taste. The win is rarely something flashy. The win is that operations stop dreading a routine import. The win is that audit questions can be answered from the system instead of reconstructed from memory and screenshots.

That kind of boring is excellent.

If I was starting the next ingestion-heavy system tomorrow, I would bias toward the same principles again:

  • one lifecycle across source types
  • validation before persistence
  • explicit, human-usable failure messages
  • replay semantics designed up front
  • configuration where it preserves clarity rather than hiding logic

That is not glamorous architecture. It is just a way to make the software behave like it respects the people who have to live with it.