Collecting a Dataset Without Collecting People

A note before I start: this article reflects my current thinking and the conservative design I’ve shipped against a mock endpoint. I’m not a lawyer, none of this is legal advice, and the project’s licensing and data terms are a decision I’m still working through before anything goes live. I’m writing it up because the engineering of “collect anonymously, by construction” is interesting regardless of where the legal lines finally land.

The whole series so far has been about one thing: extracting a clean, scored, trustworthy Recipe from a hostile page. That capability has a second life. If the extension can parse a recipe well on your machine, it can contribute that parse to a shared dataset — which can be cleaned up, cached at the edge, and served back to everyone, so the next person who hits barefootcontessa’s broken ingredient blob gets the already-repaired version instantly.

The dataset is the product. Which raises the question this article answers: how do you collect it without collecting people?

Two kinds of content, one bright line

The key realization is that a recipe page contains two legally and ethically different kinds of content, and they should be treated completely differently.

There are the functional facts: the ingredient list, the steps, the times, the yield, the nutrition. In US copyright terms these are the procedure — and §102(b) explicitly excludes procedures, processes, and systems from copyright protection. A list of ingredients and the functional directions to combine them aren’t copyrightable. (This is well-trodden ground; it’s why every recipe site looks the same and why aggregators exist.)

Then there’s the creative content: the photography, the headnote where the author tells you about their grandmother’s kitchen, the author byline itself, the video. That is copyrightable, full stop.

So the bright line writes itself:

Functional facts + the source URL → can go to the server (the URL doubles as attribution and the cache key).
Creative content → stays on your device, used only to render your reader view, never transmitted.

This is roughly how recipe APIs like Edamam operate at the conservative end: traffic in facts, attribute and link back, leave the creative work alone.

The code is the policy

The nice thing about a bright line is you can encode it as a pure function, and then the policy is enforced by construction rather than by remembering to be careful. buildPayload reduces a full canonical Recipe to the transmittable slice:

export function buildPayload(recipe: Recipe, url: string): TelemetryPayload {
    const stepGroups = recipe.stepGroups.value.map((g) => ({
        ...(g.heading ? { heading: g.heading } : {}),
        steps: g.steps.map((s) => s.text),
    }));
    return {
        url: recipe.source.url || url,
        domain: recipe.source.domain,
        name: recipe.name.value,
        ingredientGroups: recipe.ingredientGroups.value,
        stepGroups,
        ...(recipe.prepTime ? { prepTime: toDuration(recipe.prepTime.value) } : {}),
        ...(recipe.cookTime ? { cookTime: toDuration(recipe.cookTime.value) } : {}),
        ...(recipe.totalTime ? { totalTime: toDuration(recipe.totalTime.value) } : {}),
        ...(recipe.yield ? { yield: recipe.yield.value } : {}),
        ...(recipe.nutrition ? { nutrition: recipe.nutrition.value } : {}),
    };
}

Look at what’s not there. No images. No description. No author. No video, no rating. The function doesn’t strip them defensively — it never reaches for them in the first place. The creative fields exist on the Recipe (the reader renders them locally), they just never enter the payload’s shape. The TelemetryPayload type itself doesn’t have those fields, so adding them later would be a deliberate, reviewable act, not an accident.

This is the kind of invariant worth testing, so it’s tested as a property rather than trusting the implementation:

it("drops copyrightable creative content (photos, description, author)", () => {
    const p = buildPayload(makeRecipe(), "https://page.test/cake") as Record<string, unknown>;
    expect(p.images).toBeUndefined();
    expect(p.description).toBeUndefined();
    expect(p.author).toBeUndefined();
    expect(p.video).toBeUndefined();
    expect(p.rating).toBeUndefined();
});

Anonymous means no identifier exists

“Anonymous” gets used loosely. Here it means something precise: the payload carries no user or install identifier of any kind. Not a hashed one, not a rotating one — none. There’s no userId, no installId, no clientId, no sessionId. The server can’t link two parses to the same person because there is nothing to link them by.

The one ID in the system is the eventId, and it’s deliberately event-scoped, not user-scoped — a fresh crypto.randomUUID() per parse, whose only purpose is to let the server dedup resends of the same event (remember, at-least-once delivery means events can arrive twice). It identifies a delivery, not a person. There’s a test for that too, asserting the absence of the obvious identifier keys:

it("carries no user/install identifier (anonymity)", () => {
    const keys = Object.keys(buildPayload(makeRecipe(), "https://page.test/cake"));
    for (const id of ["userId", "installId", "clientId", "id", "sessionId"]) {
        expect(keys).not.toContain(id);
    }
});

This buys a genuinely nice property: anonymity-by-construction means there’s nothing to recall. “Delete my data” is just a local clear — the queue and any cached recipes on your device — because the server data was never attributable to you in the first place. That’s not a gap in a data-subject-rights story; it’s the strongest possible version of one. You can’t be forgotten by a system that never knew who you were. (IP gets stripped at ingestion too, so the transport layer doesn’t quietly reintroduce identity.)

Collecting anonymously is necessary but not sufficient. People should know, and be able to say no. But I also didn’t want a modal wall between you and the thing you installed. The model I landed on:

Default on, with a first-run disclosure in the popup — one plain sentence of what’s collected, a privacy link, and a Continue button. No dark patterns, no toggle buried in the card. The popup is the natural place because you have to open it to use the reader, so it’s the moment right before the first possible parse.
A real opt-out in settings. Off means off.
Collection is suppressed until the disclosure has actually been shown — so even a future auto-open path that never touches the popup can’t collect before you’ve seen the notice.

Those conditions collapse into one gate that both the content script and the background consult:

export async function isCollectionEnabled(): Promise<boolean> {
    const [settings, onboarding] = await Promise.all([
        settingsItem.getValue(),
        onboardingItem.getValue(),
    ]);
    return settings.telemetryEnabled && onboarding.disclosureSeen;
}

Two flags, both required: you haven’t opted out and you’ve seen the disclosure. On a fresh install disclosureSeen is false, so the honest default is that nothing is collected until you’ve been told it would be:

it("defaults to false on a fresh install (disclosure not yet seen)", async () => {
    expect(await isCollectionEnabled()).toBe(false);
});

Notice the disclosure flag lives in a separate onboarding storage item, not in user settings. It’s not a preference you tune — it’s a one-way record that the notice was shown — so it doesn’t belong in the same bag as your toggles. And opt-out doesn’t just stop future collection; it clears the pending queue immediately, so data you queued a minute ago but declined now never leaves.

The content script ties it together — gated, and fire-and-forget so it never delays the reader:

const reportParse = async (result: ExtractionResult) => {
    if (!result.recipe) return;
    if (!(await isCollectionEnabled())) return;   // the gate
    await sendMessage("recordParse", buildPayload(result.recipe, location.href));
};

The honest edges

A few things I want to state plainly rather than gloss:

The reader works whether or not you contribute. Collection is never a condition of use. The toggle changes what the dataset gets, never what you get.
The defensibility rests on the anonymity being real. The whole argument — facts aren’t copyrightable, there’s no one to attribute data to — only holds because the payload genuinely contains no identity and only functional facts. The moment that stops being true, the reasoning collapses. Which is exactly why those two properties are pinned by tests.
This is the conservative end, and it’s still pre-launch. The endpoint is currently a mock that logs instead of POSTing. Before any real collection there’s a privacy policy to write, a Chrome data-disclosure form to file, and — as I said up top — a licensing and data-terms decision I haven’t finished making. I’d rather ship the mechanism correctly and settle the policy deliberately than rush both.

That’s the series: a format-agnostic engine, a scorer that measures instead of guesses, a reader that occludes instead of fights, tests that shaped the architecture, the production plumbing WXT leaves to you, and a dataset built without building a profile of anyone. The fun part — the server that cleans and serves it back — is still ahead.

Two kinds of content, one bright line

The code is the policy

Anonymous means no identifier exists

Consent, without a wall

The honest edges