Detection Over Correction

In the last article I built an engine that turns JSON-LD, microdata, and the mess in between into one canonical Recipe. But “extracted successfully” and “good data” are two very different claims. A site can hand you perfectly valid structured data that is also garbage — every field present, schema correct, and the entire ingredient list crammed into a single string. The engine will happily extract it. So what should it do about it?

The tempting answer is: fix it. Detect the blob, split it, hand back a clean list. That instinct is exactly what made the previous version of this extractor unmaintainable, and unwinding it is the most important design decision in the whole project.

Correction is a bottomless pit

Here’s the problem with “just fix the bad data.” Correction is unbounded. There is no finite set of ways a recipe site can be wrong. One site jams everything into one recipeIngredient string. Another splits a single logical step across 23 sentence fragments. A German site writes Gläser for a unit; a French one writes amount-less lines; an Italian one puts the quantity last (flour 4 cups). Each fix you add is a guess about intent, and every guess has a false-positive rate.

And the false positives are the killer, because a wrong correction is worse than no correction at all. If I “helpfully” split 1/2 cup plus 2 tablespoons into two ingredients, I haven’t cleaned the data — I’ve manufactured a recipe that doesn’t exist. For a project whose entire value is a trustworthy dataset, silently poisoning it is the one unforgivable failure. A missing field is honest. A confidently wrong field is a lie with good formatting.

So the engine inverts the priority:

Code’s primary job on bad data is to measure it, not fix it. Detection is bounded and stable. Correction is unbounded and dangerous.

Detectors are cheap and safe — a detector that misfires just produces a slightly-off score, never a corrupted recipe. You can have as many as you want. Corrections are expensive and risky, so they’re gated behind a much higher bar (more on that below).

The quality scorer

The instrument is scoreRecipe — a pure function that runs a battery of deterministic detectors over an assembled Recipe and returns a score plus the list of every signal that fired.

export interface RecipeQuality {
    /** overall trustworthiness, 0..1 (higher = better) */
    score: number;
    /** per-field 0..1 scores (name, ingredients, steps, completeness) */
    fields: Record<string, number>;
    /** every detector that fired, for explainability + measurement */
    signals: QualitySignal[];
}

The detectors themselves are intentionally dumb. Is the ingredient list empty? Is any single ingredient suspiciously long (a likely unsplit blob)? Are there fewer than two ingredients? Is a step over 800 characters? Does the name look like it carries a Recipe Name | Site Name suffix?

const ingredients = recipe.ingredientGroups.value.flatMap((g) => g.ingredients);
if (ingredients.some((i) => i.raw.length > INGREDIENT_BLOB_CHARS)) {
    ingredientScore -= PENALTY.ingredientBlob;
    signals.push({
        code: "ingredients.blob",
        severity: "warn",
        field: "ingredients",
        message: "An ingredient is suspiciously long (likely an unsplit blob).",
    });
}

Notice the detector doesn’t do anything to the blob. It docks the score and emits a signal that says, in plain English, what it suspects. The final score is a weighted sum — ingredients and steps dominate, name and completeness are minor:

const score = clamp01(
    fields.name * WEIGHTS.name +            // 0.15
        fields.ingredients * WEIGHTS.ingredients + // 0.40
        fields.steps * WEIGHTS.steps +      // 0.40
        fields.completeness * WEIGHTS.completeness, // 0.05
);

The exact weights don’t matter yet — and that’s a feature, not an admission. Their only job right now is to make “bad” measurable and comparable across sites. The real numbers get tuned later from data, not guessed at a whiteboard.

Threshold-free by design

This is the part I’m proudest of, because it’s a discipline that’s easy to violate without noticing. The scorer measures. It never decides. There is no if (score < 0.7) repair() anywhere inside it. It hands back a number and a list of signals, and that’s the end of its job.

The one place a score turns into a decision lives in its own tiny module, quarantined on purpose:

export const LOW_QUALITY_THRESHOLD = 0.7;

export function needsRepair(quality: RecipeQuality): boolean {
    return (
        quality.score < LOW_QUALITY_THRESHOLD ||
        quality.signals.some((s) => s.severity === "error")
    );
}

Why bother separating measurement from decision so aggressively? Because the threshold is the part I know I’ll get wrong at first. If thresholds were baked into the scorer, re-tuning would mean surgery on the measurement logic, and every consumer would silently inherit a new policy. By keeping the scorer threshold-free, I can gather scores across hundreds of sites with the measurement frozen, then draw the line where the data says it belongs. The score is the thermometer; the threshold is the thermostat; you don’t want them welded together.

”Never lie” reaches into the parser too

This philosophy isn’t confined to the scorer — it’s wired into the ingredient parser as a hard rule. The parser only confidently structures the regular leading-quantity shape (2 cups flour). When it sees something it can’t safely model, it doesn’t guess — it keeps the line whole and reports low confidence.

The cleanest example is the quantity-last line. Type 00 flour 4 cups is a real pattern on Italian sites. A greedy parser might yank the 4 cups and produce { amount: 4, unit: cup, item: "Type 00 flour" }. That looks clean and is completely fabricated reasoning about a line it didn’t actually understand. So the parser refuses:

it("does NOT misparse quantity-last lines — keeps them whole at low confidence", () => {
    const r = parseIngredient("Type 00 flour 4 cups");
    expect(r.amount).toBeUndefined();
    expect(r.unit).toBeUndefined();
    expect(r.item).toBe("Type 00 flour 4 cups");
    expect(r.parseConfidence).toBeLessThanOrEqual(0.5);
});

And raw is always preserved — the parser is allowed to fail to understand a line, but never allowed to lose it. The downstream UI uses parseConfidence to decide: high confidence renders the structured 2 cups · flour; low confidence falls back to showing the raw text exactly as the site wrote it. The reader degrades gracefully instead of lying confidently.

When is a correction allowed? The rule of 3

So corrections are banned forever? No — they’re gated. A correction earns its way into a normalizer only when it clears two bars at once:

The pattern shows up in test data from at least three distinct sites, and
The fix is high-precision — safe, with a reliable way to detect when it doesn’t apply.

One-off site weirdness never gets special-cased in code; it just scores low and waits. And here’s the elegant part: the quality signals are how I count occurrences across sites. The rule of 3 isn’t eyeballed — it’s measured by the very instrument that flags the problem. “Compound quantities” (1½ cups plus 1 Tbsp) currently sits at exactly three sites, right at the threshold, which is why the parser captures the primary amount and parks the rest in notes rather than attempting a full parse — the floor is guarded, the heroics are deferred.

it("parses the primary quantity of a compound line and preserves raw", () => {
    const r = parseIngredient("1½ cups plus 1 Tbsp. (200 g) all-purpose flour");
    expect(r.amount).toEqual({ min: 1.5, max: 1.5 });
    expect(r.unit).toBe("cup");
    expect(r.notes).toContain("200 g");
    expect(r.raw).toBe("1½ cups plus 1 Tbsp. (200 g) all-purpose flour");
});

What the data actually said

Once the scorer existed, I pointed it at a corpus of 18 real sites plus a synthetic clean recipe. The result reframed the entire repair problem:

17 of 18 scored 0.95–1.00. Exactly one was broken — barefootcontessa.com, at 0.639, because it has no author and dumps its whole ingredient list into one blob. It’s become the canonical broken case I test against.

That ~89%-clean number is genuinely useful, and a little deflating. Modern structured data is mostly good. Repair isn’t the common case I’d imagined — it’s a long tail. Which means the highest-value work isn’t writing more correction heuristics; it’s collecting more known-bad fixtures so the scorer has something to discriminate against. A scorer validated only on clean data isn’t really validated at all.

It also means the eventual repair strategy writes itself: the score is field-level, so a future LLM-repair pass can target only the bad fields on the rare bad site — fix barefootcontessa’s ingredient blob, leave its (perfectly good) steps alone — and because every URL is repaired once and cached, the expensive path runs at most once per broken page.

The scorer didn’t fix a single recipe. It told me which ones need fixing, how badly, and where — and turned “make the data good” from an infinite coding project into a bounded measurement problem. That’s the whole trade.

Next up: the reader has to actually render over some of the most hostile pages on the web. That’s a different kind of fight.