Building matchers that explain themselves

A practical note on weighted scoring, audit trails, and reviewer feedback loops for metadata matching systems that people can actually trust.

C#.NETSQL ServerAudit trailsAI-adjacent

There is a version of a matching system that looks clever in a demo and immediately becomes a support burden in production.

You feed in a pile of transmissions, productions, episodes, titles, alt titles, durations, and timing windows. It emits answers. Some of them are correct. Some are not. The bad ones are not obviously bad until somebody downstream notices a report is wrong, or a cue sheet looks odd, or a QA pass turns into a detective story.

That version of the system is not useful enough.

What I wanted instead was something more operational: a matcher that could say why it chose a candidate, what evidence it trusted, where the ambiguity was, and how a reviewer could push the system toward better later decisions.

The core idea

At the centre was a weighted scoring engine.

Not a mystical black box. Not a single confidence number conjured from nowhere. A sum of named signals:

title similarity
episode alignment
runtime distance
channel and scheduling context
source quality
conflict penalties

Each signal contributed to a score, and the score was stored with a breakdown. If a candidate won because runtime was near-perfect but title similarity was only mediocre, you could see that. If two candidates were nearly tied and one lost because an episode rule penalised it, you could see that too.

This sounds obvious written down. It is still worth insisting on.

Systems become supportable when they leave a trail.

What the audit trail needed to answer

I kept coming back to a few practical questions:

Why did we match this record at all?
Why this candidate and not the next closest one?
Which inputs were missing or weak?
What would a reviewer need to correct in order to improve future runs?

That drove the shape of the data we persisted. The scoring output was not just winner_id and score. It included structured reasons, penalty markers, and enough context that QA could inspect a decision without rebuilding the whole run in their head.

That gave us a few immediate benefits:

debugging was faster because bad matches had a visible failure surface
product conversations became more concrete because we could talk about signals, not vibes
reviewer decisions could become training data for future refinement instead of disappearing into email threads

The trick is not just "be accurate"

Accuracy matters, but production matching work is usually multi-objective.

You want:

acceptable precision and recall
deterministic reruns
explainability for edge cases
a review loop that fits real operational time

If you optimise only for aggregate accuracy, you can still produce a system that people distrust.

In practice, some of the best improvements came from making the decision process legible:

confidence bands instead of pretending every result was equally certain
explicit conflict detection
clear "no match" outcomes rather than forcing a weak winner
reviewer tooling that made suspicious cases easy to inspect

That last point matters more than it sounds. A matcher is not just a model or a rules engine. It is also the workflow around disputed decisions.

Human review should not be an apology layer

A lot of software treats review as a euphemism for "somebody will clean this up later."

I wanted review to act as a force multiplier instead.

That meant a reviewer seeing:

the top candidates
the score breakdown
the decisive penalties
enough metadata context to make a fast decision

And when a reviewer marked something as correct, incorrect, or no match, that outcome became a durable signal. Not just a one-off fix.

This is one of the places where AI-adjacent work and normal systems engineering overlap in a healthy way. You do not need to jump straight to a model to benefit from feedback loops. You can start by making your existing deterministic system measurable and inspectable.

A note on "intelligence"

People often ask whether a matcher like this should eventually become an ML problem.

Maybe. But I think there is a more useful ordering:

make the deterministic system explicit
persist the evidence
build the review loop
learn where the ambiguity actually lives

Only then does it make sense to ask where probabilistic methods would add value.

If you skip those steps, you end up with a system that is harder to reason about and still weak operationally.

If you do them first, you create a clean runway for future ML work, because you already have labelled outcomes, score histories, and a decent picture of the problem surface.

What I like about this kind of system

It sits in an interesting zone between product, infrastructure, and decision science.

The code matters. The data model matters. The reviewer interface matters. The logging matters. Each part changes how much trust the organisation can place in the output.

And trust, in this kind of internal software, is usually the whole game.

If a system can explain itself, people use it sooner, challenge it better, and improve it faster.