Throwaway by Design: Building a Regression Verifier with Agents

I help product teams build quality software and lead engineering efforts. Currently working at OpenSpace as a Senior Software Engineer.
Last month I built a regression-testing harness for a single library upgrade — React Query v4 → v5 in one of the front-end apps I'm working on. It's about 800 lines of code across four packages. It runs once. It gets deleted after the migration ships. Nobody on the team is going to maintain it, and nobody is meant to.
A year ago, writing this by hand would have been an obvious waste of my time. Today, with agents doing all the writing, the math is different.
This post is about that shift: what I built, what it caught, and why a tool I'm about to delete was worth building at all.
The bugs we couldn't economically test
React Query is the data-fetching layer for our entire app and one of the most important dependencies of the system. The major version we were upgrading to removes the state change callbacks from queries, renames core APIs for data caching, and requires the object form of useQuery.
In the codebase, approximately 350 files import React Query, and over 1,000 usages required migration. Manual testing across that surface area is nearly impossible — or at least, highly impractical.
The usual safety nets handle what they're designed for. TypeScript catches signature breaks. Unit tests catch logic. Cypress covers our critical flows. After the migration, all of that still passed. CI was green.
What CI didn't catch is the class of bugs I actually had to worry about — symptoms that show up specifically for this kind of migration:
Duplicate fetches that didn't happen in v4
Errors v4 silently swallowed, now surfacing
Paint regressions — multi-second timing drift on routes that "still work"
These aren't unassertable. You could write a test that says "this page makes exactly one call to /api/user." You could write a test that measures paint time and fails if it exceeds a budget. The verifier I built essentially does exactly that.
The problem was never "can we detect this." The problem was: who owns those assertions?
A route-by-route check on request counts is tied tightly to the current implementation. Pin it down today and it breaks the next time someone refactors a hook. Pay the maintenance cost for years for a check that earns its keep once, during a migration that ships in a week or two — and even then, only for the pitfalls you already knew to look for. You can't really write all the assertions you'd need. That's a bad trade. Nobody writes those tests, and nobody should.
The bottom line was: pre-agents, this left a real gap — a class of regressions I knew were possible, in a place where the cost of catching them outweighed the cost of letting them slip. With agents, the calculation changes.
I didn't need a new kind of check. I needed a way to run those checks without anyone owning the code that runs them.
What I built
Here's what I came up with. The pipeline ended up as four phases — small components, each doing one job, each writing its output to a file the next phase reads. No orchestrator, no shared memory, no clever process management. Just files, with a single Markdown report at the end.
Here's the whole thing end to end, from reading the codebase to the final report:
Let's walk through, left to right.
Cartographer. This is the part that figures out what to test. It reads the app's routing config, counts React Query hooks per route, checks which routes Cypress covers, and notices which components touch removed v4 APIs. Together, those four signals tell us how risky each part of the app is to migrate.
The cartographer ranks routes by that risk and writes them out as plan.yaml. Each route gets a baseline sequence of steps for the next phase to execute — navigate, wait for the page to settle, capture artifacts. For the highest-risk routes, the cartographer also drafts interaction flows: clicking a link, opening a modal, etc., picked from the components those routes render.
Behind the scenes, the cartographer is a 150-line Markdown prompt that the agent runs against the app's repo.
A trimmed entry looks like this:
routes:
- path: /dashboard
risk: high
risk_reasons: [removed_onSuccess, no_cypress_coverage]
capture: [screenshot, a11y, network, console, rq_cache, timing]
flows:
- name: open-first-item
steps:
- { action: click, target: 'role=link >> nth=0' }
- { action: wait_for_idle }
- { action: capture }
Runner. Now we actually drive the browser. The runner is Playwright walking through the plan one step at a time — open the page, click the thing, wait for everything to settle, then capture. This step is fully deterministic and programmatic: there's no per-run cost, and it scales for free to the size of your system.
Two things make the capture trustworthy. Every flow runs in its own fresh browser context, and both the clock and the viewport are frozen. That freshness isn't optional: cache eviction semantics changed between v4 and v5, so if I reused a context across flows I'd be comparing two different starting states and calling the difference a regression.
When a capture step fires, six recorders snapshot the world at that moment:
a full-page screenshot
the accessibility tree
a network log — JSONL with shape-hashed response bodies, so I can diff structure without diffing values
a console log
a snapshot of the React Query cache, lifted through
window.__queryClient(one dev-only line in the app)timings from the Performance API
Then I run the whole thing twice — once against v4, once against v5. Same plan, same data. I end up with two directories of artifacts, ready to be diffed.
Diff engine. With two sets of artifacts on disk, the rest is mechanical. This phase does one thing: compare. It walks artifacts/v4 and artifacts/v5, pairs them up by route, flow, and step, and runs a set of comparators side by side. There's one comparator per artifact type from the previous round — each one knows how to read its own kind of capture and say what changed:
network — normalises URLs (strips timestamps and IDs), groups by method + path, and flags new, missing, or duplicated requests
cache — matches by stringified query key
console — does a set difference of error messages
timing — flags any step where v5 is at least 20% or 500ms slower
screenshot — runs through
pixelmatchwith anti-alias tolerance, and only fires if more than 0.5% of pixels changeda11y — compares the accessibility trees and flags nodes that appeared or disappeared
The output is a flat diffs.json — one record per change. A record looks like this:
{
"route": "/dashboard",
"step": "step-02-capture",
"kind": "network",
"subkind": "duplicated_request",
"details": {
"key": "GET /api/user",
"v4": 1,
"v5": 2
}
}
Again, zero LLM in this phase. It answers what changed, not whether it matters — and that's what the next and final step is for.
Triage agent. Now for the judgement call. A list of raw differences isn't an answer yet — most of them are perfectly expected, a few are noise, and a handful are the bugs I actually care about. Telling those apart is judgement work — the kind of thing that genuinely needs a brain. This is, again, where AI comes in. The triage agent reads diffs.json and labels every record as expected, noise, or likely-regression, with a one-line rationale for each.
The prompt includes a short v4 → v5 changelog primer — what changed, what to wave through as expected, what to look at twice. I pass it the complete diffs.json and ask it to flag entries. Relatively cheap per run, even on thousands of records.
And because each step writes its result to disk, I can re-run any piece on its own. Want to re-triage with a sharper primer? Run the last step. Want to add a flow? Re-run the runner for that one route. Want to tighten the screenshot threshold? Just the diff engine. It's all just files.
AI where it earns its keep
If you take one thing from this post, it's this: don't reach for AI everywhere. Reach for it where the work is judgement, not mechanics.
LLM at the edges. Look back at the four phases and notice where the model actually shows up: the first step and the last. Codebase exploration is a set of judgement calls. Ranking routes by risk is a judgement call. Deciding whether a duplicated request is a real regression or just an expected v5 quirk is a judgement call. All of them are fuzzy, all of them need context, and all of them are happy to be done in one big batch. That's exactly the kind of work an LLM is good at.
Deterministic in the middle. Everything between those two ends is mechanical. Driving Playwright through a script, walking two directory trees, comparing JSON — none of that needs to think. An LLM could do it, sure, and you'd gain a little flexibility — but this work is relatively simple and repetitive. You'd just pay more and trade a guaranteed answer for a probabilistic one.
Here's how one finding moves through the pipeline:
The runner counts requests on the page (deterministic).
The diff engine notices v4 made one call, v5 made two (deterministic).
The triage agent reads that record alongside the changelog and says "canonical
keepPreviousDataregression pattern; likely-regression" (LLM).
The most time-consuming parts — running the app across every route, capturing six artifacts at each step — never touch the model at all. That's what keeps the cost of re-running the whole thing close to nothing.
Plan once → Run mechanically → Judge once.
That split is the whole difference between a pipeline I can re-run on a whim and one that's a non-starter. Put a model in the middle and every re-run pays it to repeat the testing, capturing, and diffing across the whole app. At our scale, that alone would sink it.
What it caught
I smoke tested it on a small subset of the app first, adjusting the setup as I went, and once the results looked good enough I pointed it at the full application.
It produced 248 paired diff records, filtered 191 of them away as expected or noise, and flagged 57 as likely regressions.
The flagged findings landed in exactly the categories I'd set out to catch.
Duplicated requests. A global-config endpoint went from two calls in v4 to three in v5 on a couple of the busier routes, and a user endpoint went from one call to two on a handful of others.
Silent errors. A
TypeError: Failed to fetchstarted surfacing across roughly a dozen routes — same code path, same network behaviour, just no longer quietly absorbed.Paint regressions. Three routes came back noticeably slower in v5 — all of them by around three seconds.
And then a category I hadn't even thought to look for: two of the less-trafficked routes stopped rendering altogether. A shared layout's error boundary tripped in v5 but not v4, taking down everything underneath it. I never put "blank screen" on my list of things to check — the pipeline caught it anyway, because the runner records console errors and the diff engine noticed brand-new ones in v5. That's the part I didn't expect: it surfaces more than what you specifically aim it at.
What I really liked, though, was that the output was readable. Here's the triage agent's rationale on one of the duplicate-request findings:
Canonical
keepPreviousDataregression pattern: whenplaceholderDatais wired up wrong, the query can re-fetch on every render.
No prompting from me to get that. It read a structured diff record, matched it against the v5 changelog, and wrote a plain-English explanation of why I should care. I can act on that sentence. I can't do much with a bare "v4: 1, v5: 2."
And that's really the whole point of the exercise. The goal was never to catch every bug there is — it was to surface the regressions I had reason to suspect, before they shipped. On that, it delivered.
This code isn't written to be read
The verifier's code doesn't matter. The app's code does. The one and only job of the first is to protect the second.
I never read most of the code that does the work. I didn't read the runner. I didn't read the diff engine. What I actually read was the prompt behind the cartographer, the architecture sketch I committed before any code existed, and the test reports — enough to know whether the implementation was heading in the right direction. That was the whole of my review.
I'm comfortable with that, as this thing runs once. Maintaining it, or even reviewing it line by line, would cost more than the tool is worth.
What I do have to trust isn't the code — it's the process. Is the four-phase split the right one? Is the AI making judgement calls in the places that actually need judgement? Are the results directionally correct, and do they suggest the approach is sound and genuinely useful? Those are the only questions worth the attention.
None of this would hold true if the tool had to stick around. If someone needed to maintain it six months from now, we'd need a different approach — one with a thorough review and code I'd actually stand behind.
Takeaways
AI brainstorming gets me to ideas I wouldn't have reached alone. The whole four-phase split started as a brainstorming session, and the opening prompt was nothing fancy:
"Let's brainstorm a current industry standard way of regression testing with AI agents. Use your search tool to find any relevant resources, videos, and other material that would be applicable to our situation. I'm upgrading a data fetching library on a large-scale web application and I need an automated way to regression test it. I will need to come up with a good test plan or a prompt that will gather the test plan and then for what's most important I will need to do verification that I didn't introduce any regressions."
That's it. No architecture, no plan, no shape — just the problem and a rough sense of the scale. The split that runs through this whole post came out of the conversation that followed, shaped around the constraints of our system. I'm fairly sure I wouldn't have landed on it alone, and I definitely didn't write it down first and then build it. I steered; I checked that the direction was sound and the output had signal; the agent worked out the details.
The signal is what I'm after, not the code that produces it. I was never trying to catch every bug. I wanted to catch some bugs I'd have no realistic way of finding by hand — exactly the class nobody is ever going to write permanent tests for. Once that's the goal, everything downstream changes: how closely I read the output, what "good enough" means, and how little I care about the quality of the code underneath.
These are the real takeaways. I'll delete this verifier once the migration ships, and the next person with a problem like it won't reuse my code — they'll spin up their own throwaway, same shape, different details.
Summary
A few things worth noting:
Some classes of bugs are detectable in principle but not worth testing the conventional way. Migration regressions are a perfect example.
Agents unlock single-purpose, throwaway software. You can build a one-off harness, run it, and throw it away — the cost of asserting something is no longer the cost of owning that assertion forever.
Split the work by what each side is good at: judgement at the edges, mechanics in the middle. That's what keeps re-runs cheap enough to actually re-run.
The artifact is disposable; the signal is the deliverable. What you have to trust is the process, not the code.
Agents make disposable software economic, and that's the whole point. So the next time you hit a migration, a one-off audit, or a question your existing tools just aren't the right shape for, ask yourself whether it's worth building something you'll throw away. A year ago the answer would have been no. Today it's likely yes.
If you enjoyed the article or have a question, feel free to reach out on Bluesky! 👋
Further reading and references
- Photo by Intricate Explorer on Unsplash





