Testing as a First-Class Citizen

Most of the design decisions in this rebuild were made for testability, not bolted on after. That’s the thesis of this article: testing isn’t a phase at the end, it’s a constraint you let push back on your architecture from the start. When you do, the code that comes out is the same code that’s easy to test — pure where it can be, with side effects pushed to the edges.

Let me walk up the testing pyramid, because this project has a genuine one: a wide base of fast unit tests, a middle layer of golden-master snapshots, and a thin top of slow, real-browser e2e.

The base: pure functions, tested directly

The widest, fastest layer tests pure functions with no I/O. The extractor is almost entirely this — every normalizer is a function from input to output with no clock, no network, no DOM-of-record. Testing them is boring in the best way:

it("handles unicode and mixed fractions", () => {
    expect(parseIngredient("1 ¼ cups sugar").amount).toEqual({ min: 1.25, max: 1.25 });
    expect(parseIngredient("½ teaspoon salt").amount).toEqual({ min: 0.5, max: 0.5 });
    expect(parseIngredient("1 1/2 tablespoons oil").amount).toEqual({ min: 1.5, max: 1.5 });
});

The interesting move is on the extension side, where you’d expect everything to be entangled with browser state. The telemetry queue — the thing that survives service-worker restarts and drives network egress — has its core modeled as a pure reducer:

export function queueReducer(state: QueueState, action: QueueAction): QueueState { ... }

No storage, no clock, no fetch. Every time-dependent input (the “now” for expiry and backoff) is passed in as a parameter. That one discipline means the entire ring-buffer-overflow, dedup, backoff, and expiry logic is testable as plain data-in/data-out, with zero mocking:

it("drops oldest on overflow and counts the drops (no silent caps)", () => {
    let s: QueueState = EMPTY_QUEUE_STATE;
    for (let i = 0; i < QUEUE_MAX_EVENTS + 5; i++) {
        s = queueReducer(s, { type: "enqueue", event: ev(String(i), `u${i}`) });
    }
    expect(s.events).toHaveLength(QUEUE_MAX_EVENTS);
    expect(s.dropped).toBe(5);
    expect(s.events[0]?.eventId).toBe("5");
});

There’s even a test that asserts the reducer’s purity directly — same input, same output, no mutation of the argument:

it("is a pure reducer — same input, same output, no mutation", () => {
    const before = JSON.parse(JSON.stringify(EMPTY_QUEUE_STATE));
    queueReducer(EMPTY_QUEUE_STATE, { type: "enqueue", event: ev("1", "u1") });
    expect(EMPTY_QUEUE_STATE).toEqual(before);
});

This is the reducer pattern from Redux/Elm, used here not for UI state but for durable state — and the payoff is the same: the hard logic is a pure function you can hammer with cases instantly.

One step up: I/O, faked at the boundary

Some logic genuinely is about side effects — the flusher’s whole job is to talk to storage and the network. You can’t make that pure, but you can make its dependencies injectable and fake them.

Two tools do the work. wxt/testing ships a fakeBrowser — an in-memory implementation of the extension storage APIs, so storage.defineItem works in a plain Vitest process. And flush takes its fetch as a parameter (defaulting to the real one) so a test can hand it a mock:

it("sends a batch and removes events only on 2xx", async () => {
    await dispatch({ type: "enqueue", event: ev("1", "u1") });
    const fetchMock = vi.fn().mockResolvedValue({ ok: true, status: 200 });
    const sent = await flush(fetchMock as unknown as typeof fetch, NOW);
    expect(sent).toBe(true);
    const state = await queueItem.getValue();
    expect(state.events).toHaveLength(0); // removed only because 2xx
});

it("retains events and backs off on a non-2xx response", async () => {
    await dispatch({ type: "enqueue", event: ev("1", "u1") });
    const fetchMock = vi.fn().mockResolvedValue({ ok: false, status: 500 });
    await flush(fetchMock as unknown as typeof fetch, NOW);
    const state = await queueItem.getValue();
    expect(state.events).toHaveLength(1); // kept — at-least-once
    expect(state.consecutiveFailures).toBe(1);
});

These tests pin the reliability contract — at-least-once delivery, remove-only-on-2xx, backoff after failure — without ever touching a real network. The consent gate gets the same treatment: seed fakeBrowser storage into each state, assert the gate:

it("is true only when enabled AND disclosed", async () => {
    await setState(true, true);
    expect(await isCollectionEnabled()).toBe(true);
});

The middle: golden-master site snapshots

Here’s my favorite layer, because it’s doing double duty. For every real recipe fixture in the corpus, a test runs the full extraction and snapshots the entire result — canonical recipe, quality score, signals, diagnostics — to a checked-in JSON file:

describe("json-ld site snapshots", () => {
    for (const file of fixtures) {
        it(file, async () => {
            const doc = loadJsonLdFixture(join("jsonld", file));
            const result = extractRecipe(doc, urlFor(file));
            const snapshot = JSON.stringify(result, null, 2);
            await expect(snapshot).toMatchFileSnapshot(
                join(here, "snapshots", "jsonld", `${base}.snap.json`),
            );
        });
    }
});

This is golden-master / characterization testing — Vitest’s toMatchFileSnapshot writes one file per site, and vitest -u re-records after an intentional change. As a regression guard it’s ruthless: change a normalizer and you see, in a reviewable diff, every site whose output moved — including the ones you didn’t mean to touch.

But it’s also a measurement instrument. These snapshots were the first cross-site quality data I had — they’re how I learned that 17 of 18 sites score above 0.95 and exactly one is broken. The test suite and the data-quality research are literally the same artifact. (That snapshot loop — add fixture, vitest -u, scan the score table, count patterns toward the rule of 3 — is the engine’s whole feedback cycle.)

The top: real-browser e2e

Unit tests can’t tell me whether a shadow-root overlay actually paints above a max-z ad in a real renderer. For that you need a real browser, which is the slow, thin top of the pyramid.

Playwright loads the built MV3 extension into a persistent Chromium context. The trick that makes it work headless is channel: "chromium" — Playwright’s bundled new-headless Chromium supports extensions, so there’s no dependency on a system Chrome install:

const context = await chromium.launchPersistentContext("", {
    channel: "chromium",
    headless: true,
    args: [`--disable-extensions-except=${EXT_PATH}`, `--load-extension=${EXT_PATH}`],
});

The reader is triggered via the autoOpen setting (seeded straight into fakeBrowser-free real storage by the test) rather than by driving the popup, which can’t target a specific tab. Then the specs assert the things only a renderer can confirm. The cleverest one is keepOnTop, which leans on a subtlety of how Playwright clicks: it refuses to click an element that’s covered at the click point. So “can I click the close button?” is the assertion that the reader is painting on top:

test("keepOnTop: reader stays clickable above a late max-z ad", async ({ context }) => {
    const page = await context.newPage();
    await page.goto("/ad-techniques/late-max-z.html");
    const host = page.locator("readable-recipes-root");
    await expect(host).toBeAttached();

    await page.waitForTimeout(700); // ad injects at 500ms, re-appends every 1.5s
    const close = host.getByRole("button", { name: /close/i });
    await close.click();              // succeeds only if the reader isn't covered
    await expect(host).toHaveCount(0);
});

The fixtures are synthetic hostile pages — one technique each (late-max-z, scroll-jacker, autoplay-media) — which doubles as a permanent regression guard for the ad fight and as manual smoke pages I can open by hand.

What’s automatable, and what isn’t

A pyramid is also about being honest where the automation stops. Playwright drives Chromium and Firefox — but it can’t load Firefox extensions at all, so Firefox stays a manual pnpm dev:firefox smoke test. And the iOS-Safari concerns — does the background service worker wake on startup, is storage.local durable under memory pressure — can’t be touched by Playwright either; those go on a manual iOS-Simulator checklist. Writing that down matters: silent gaps in coverage read as “tested” when they aren’t.

Tying it off in CI

The layers map onto CI jobs that run in parallel — lint/typecheck, unit tests, and both-browser builds as independent fast jobs — with the Playwright e2e as its own gating job. The e2e job started life as allow-failure while I proved it stable; once it was green and reliable, it became a real gate, so a reader regression now blocks a merge instead of failing quietly.

None of this is exotic. The point is the ordering of causation: I didn’t write a reducer and then discover it was testable. I made the queue a reducer because I wanted to test the durable logic without a browser, made fetch injectable because I wanted to test reliability without a network, and snapshotted whole results because I wanted regressions and quality data from one artifact. Testability was an input to the design, and the architecture is better for it — which, conveniently, is the subject of the next article: everything WXT doesn’t do for you.