7 min read
Extracting Clean Recipes from Messy Pages

If you cook, you already know the problem. You search for a recipe, land on a page, and before you can read a single ingredient you’re wading through a life story, a wall of ads, an autoplaying video, three newsletter popups, and a “jump to recipe” button that scrolls you to roughly the right neighborhood. The recipe — the actual reason you’re there — is maybe 1% of the page.

Readable Recipes is a browser extension that fixes that. It pulls the recipe off the page and renders it in a clean reader view: title, ingredients, steps, times, nothing else. This article is about the part that has to happen first, and the part that’s quietly the hardest: getting a trustworthy recipe out of the page in the first place.

You can follow along with the engine code in the extractor package.

The data is there. That’s the trap.

Most recipe sites embed their recipe as structured data — machine-readable markup meant for Google’s rich results. There are three flavors in the wild: JSON-LD (a <script> blob), microdata (itemprop attributes sprinkled through the HTML), and RDFa (rarer cousin). So in theory you don’t have to scrape anything. You read the structured data and you’re done.

In practice, every site does it slightly differently, half of them do it wrong, and “has structured data” doesn’t even mean “has the recipe in the structured data.” One site I tested bolts on a JSON-LD block that contains only its breadcrumb navigation — the recipe itself is still microdata-only. Another crams its entire ingredient list into a single string. The data is there, which lulls you into thinking the job is easy, right up until you’ve written your fourth special case.

The first version of this extension fell into exactly that trap: a parser per format, each one growing site-specific patches, two of them duplicated and only the weaker one actually wired up. The rebuild started from a different premise.

One pipeline, many sources

The bet is this: the format a site uses to express a recipe should be the only thing that varies. Everything downstream — parsing an ingredient line, normalizing a duration, sorting images, scoring quality — should be written exactly once and shared.

So the engine is a pipeline with a single seam at the front:

SourceAdapter[]  →  RawRecipe[]  →  assemble  →  merge  →  Recipe
   (per format)     (common shape)   (normalize)  (combine)  (canonical)

A SourceAdapter is the entire extension point. It does two things: cheaply detect whether its format is on the page, and extract zero or more raw candidates. It is contractually forbidden from throwing — when something is wrong it records a diagnostic and keeps going.

export interface SourceAdapter {
    readonly id: SourceId;
    /** structural prior used in confidence scoring (0..1) */
    readonly baseConfidence: number;
    /** cheap check: is this format present on the page? */
    detect(ctx: ParseContext): boolean;
    /** extract 0..n raw candidates */
    extract(ctx: ParseContext): AdapterResult;
}

Every adapter — whether it read a JSON-LD blob or walked a tree of itemprop attributes — emits the same intermediate shape, RawRecipe. This is the whole point. RawRecipe is deliberately loose (“as extracted”); it’s the normalizers’ job to tighten it up later.

export interface RawRecipe {
    source: SourceId;
    url?: string;
    isMainEntity?: boolean;

    name?: string;
    ingredients?: string[];
    ingredientGroups?: RawIngredientGroup[];
    instructions?: RawInstruction[];

    prepTime?: string;
    cookTime?: string;
    totalTime?: string;
    yield?: string | number | string[];
    nutrition?: Record<string, string>;
    // ...author, images, rating, video, etc.
}

The pipeline itself stays boring on purpose. Run each adapter that detects its format, collect candidates, assemble the best one from each, and field-merge them highest-confidence-first:

for (const adapter of adapters) {
    if (!adapter.detect(ctx)) continue;
    const result = adapter.extract(ctx);
    diagnostics.push(...result.diagnostics);
    if (result.candidates.length > 0) {
        present.push({ adapter, candidates: result.candidates });
    }
}

present.sort((a, b) => b.adapter.baseConfidence - a.adapter.baseConfidence);
const assembled = present
    .map((p) => assembleRecipe(p.candidates[0]!, url, p.adapter.baseConfidence))
    .filter((r): r is Recipe => r !== null);
const recipe = assembled.length > 0 ? mergeRecipes(assembled) : null;

Notice what’s not here: no if (site === "..."), no try/catch swallowing failures into a silent null. Adapters surface what went wrong as diagnostics, so a page that returns no recipe can tell you why (“N JSON-LD blocks, 0 Recipe nodes”) instead of just shrugging.

JSON-LD doesn’t hand you a recipe

JSON-LD looks the friendliest of the three formats and is the most quietly hostile. A real-world block isn’t one tidy Recipe object — it’s a @graph array of loosely connected nodes (WebPage, Organization, Person, Recipe, ImageObject…) that point at each other by @id. The recipe’s author is often just { "@id": "#person-1" }, a bare reference to a node defined elsewhere in the graph. Read it naively and your author is the string "#person-1".

So before mapping anything, the JSON-LD adapter flattens every <script> block into one node pool (expanding @graph as it goes), indexes by @id, and resolves the bare references — with a cycle guard and a depth cap, because these graphs do contain loops:

const id = value["@id"];
if (typeof id === "string" && isPureRef(value) && idx.has(id) && !seen.has(id)) {
    const target = idx.get(id)!;
    const next = new Set(seen);
    next.add(id);
    return resolveRefs(target, idx, next, depth + 1);
}

Only after the graph is whole does the adapter go looking for Recipe nodes — and a page can have more than one, so it prefers the node whose URL matches the page you’re actually on, then falls back to completeness, then document order. This is the kind of detail the “just read the structured data” pitch never mentions.

Every value carries its own receipts

The canonical Recipe the engine produces doesn’t store plain values. Every field is wrapped so its value, trust, and provenance travel together:

export interface Field<T> {
    value: T;
    confidence: number; // 0..1
    source: SourceId;
    raw?: string; // original string before normalization
}

That confidence is not decoration. It’s how the reader UI decides whether to trust a parsed ingredient line or fall back to rendering the original raw text. And keeping raw around is a hard rule: the parser is allowed to fail to understand a line, but it is never allowed to lose it. A missing field beats a wrong field. (That principle gets its own article — the engine is built to measure bad data, not heroically guess at fixing it.)

The payoff

Here’s the moment that told me the architecture was right. The whole engine was built and tested against JSON-LD. Then I added the second adapter — microdata, a completely different beast that walks itemprop attributes through the DOM per the WHATWG value algorithm. It emits RawRecipe. That’s all it had to do.

Zero pipeline changes. Zero new normalizers. And the microdata path immediately produced nutrition, yield, timing, and author — fields the old extension’s microdata parser used to silently drop — because all of that lives once, downstream of the seam, shared by every adapter. The second adapter was the test of the abstraction, and it passed by being almost boring to add.

That’s the foundation. The next article is about what happens when the data you extract is technically valid and still garbage — and why the engine’s job is to score that, not silently “correct” it.