FOR MANUAL TESTERS

Opening the hood

AI now lets a manual tester read the code against the acceptance criteria, without having to be a developer. This page is about what that changes, and what it doesn't.

The car in the garage

Imagine taking your car for inspection.

The mechanic comes back 1 hour later.

“Your car is good. I drove it. It works.”

You'd say: did you even open the hood?

That's a drive without an inspection. Driving the car is not wrong. A good mechanic always does both, and the drive is what proves the car actually moves.

What's new is this: AI now lets the manual tester do the inspection too, before the car leaves the garage.

A detective in a fedora with a magnifying glass, inspecting under the hood of a car in a dim garage.

One thing to get out of the way

Reading the code doesn't replace testing, and it isn't the opposite of manual testing. They are different activities that belong next to each other. Running the feature tells you whether it works. Reading the code tells you whether the developer actually built what was asked. You want both, and until now the second half was usually a developer's job.

What a manual tester could never do before

Reading a diff used to be a developer skill. You had to know the codebase, the framework, the idioms. If you were a manual tester, the code was a locked box; you were handed a feature and asked whether it worked.

That boundary just moved. An LLM reads the diff, reads the acceptance criteria, and tells you in plain language which ACs have matching code, which don't, and where the edges of the change are. You don't need to parse TypeScript or trace React state. You need to know what the feature is supposed to do, and you already do.

That's the whole contribution. Not a new category of testing. A new pathway into something developers have been doing at code review for decades.

What a drive alone can't see

A passing manual test is not the same thing as a correct implementation. It only proves the paths you thought to walk.

CODE MATCHES THE SPEC

  • Every acceptance criterion has matching code.
  • Edge cases and error paths appear to be handled.
  • Your test drive is a confirmation rather than a discovery.

THE HAPPY PATH LOOKS RIGHT

  • Your manual pass succeeds.
  • But AC-3 is silently not implemented at all.
  • It ships. The bug arrives six weeks later from production.

Without reading the code, the two cases look the same from the driver's seat. Once you have the report, they don't.

Meet the agent that reads the diff

The functional-reviewer answers one question: does this diff actually implement the AC?

It reads the acceptance criteria and the unified git diff of the branch, then runs five checks. Read them as things the agent looks for, not things it can prove:

coverage

For each AC, is there code in the diff that appears to address it? Or is nothing in the diff pointing at it?

correctness

Does the code look like it does what the AC describes, or does it contradict the spec (wrong keys, inverted conditions, reversed defaults)?

edge cases

Nulls, boundaries, permissions, locale, races. Which ones does the code visibly handle, and which does it silently skip?

side effects

What did the diff touch that the AC never mentioned? The agent sees changes inside the diff; regressions that only show up at runtime still need the drive.

completeness

Validations, error handling, failure paths. Are both happy and sad paths represented in the diff?

WHAT THE AGENT IS AND IS NOT

Reading the code is not running it. When the agent says an AC has "matching code," that means it looks addressed in the diff, not that the behaviour has been confirmed. The drive still has to prove the feature works.

The output is a single report: a verdict (approve, approve with conditions, or request changes), a table showing which ACs have code, and a list of findings pointing at specific files and line ranges.

What the output looks like

Here's an excerpt from a review of a dark-mode feature. Five ACs, two with no matching code in the diff. Not visible from the driver's seat, trivial to spot once someone reads the code.

# Functional Review Report
Ticket: #42 (Add dark mode toggle)
Risk: 6/10 (rough signal, not a calibrated score)

## AC Coverage

| #    | Acceptance Criterion                          | In diff?       | Code reference           |
|------|-----------------------------------------------|----------------|--------------------------|
| AC-1 | Toggle visible in nav header on every page    | Code present   | Header.tsx:10            |
| AC-2 | Clicking switches mode immediately            | Code present   | DarkModeToggle.tsx:7-11  |
| AC-3 | Choice persists across sessions (localStorage)| No code found  | (none)                   |
| AC-4 | data-testid="dark-mode-toggle"                | Code present   | DarkModeToggle.tsx:16    |
| AC-5 | Default matches prefers-color-scheme          | No code found  | (none)                   |

## Findings

- AC-3: No localStorage read/write anywhere in the diff; theme state will reset on reload.
- AC-5: useState(false) hardcodes initial value; no prefers-color-scheme check on first visit.

Verdict: Request changes. Two of five ACs have no matching code.
("Code present" means the code looks like it addresses the AC. It does
not mean the behaviour has been confirmed. The drive does that.)

A manual tester can drive this car all day and the happy path passes: the toggle clicks, the theme flips. AC-3 only surfaces if you happen to close the tab and reopen it. AC-5 only fails on a machine with a system preference you don't use. The report caught both without touching the browser. The drive still has to confirm that AC-1, AC-2, and AC-4 behave the way the code suggests.

What the agent can't do

If you stop reading here, you will overrate what the agent is doing. A fair picture has to name four places it can fail you.

1. Garbage AC in, garbage verdict out

The whole pipeline assumes clean, unambiguous, testable acceptance criteria. Most real backlogs do not have those. An AC that says "the feature should feel snappy" gives the agent nothing concrete to check code against. The quality of the review is capped by the quality of the spec. If an AC is vague, fix the AC before trusting the report.

2. LLMs hallucinate file references

Language models fabricate paths and line numbers with confidence. A report that says DarkModeToggle.tsx:16 is only trustworthy if the agent actually looked at the file and the reference is checked against the real diff before being shown. Treat any citation the agent gives you as a claim to verify, not a fact to quote. Its job is to narrow your search, not to end it.

3. The risk score is a rough signal

A number like "6/10" looks like a measurement. It isn't. There's no explicit rubric behind it. Read it as a rough ordering between tickets, not as a calibrated probability. If it ever gets used to gate merges, the methodology needs to be written down first, or it will erode trust the first time a low score ships a bad bug.

4. Nothing checks the reviewer

In this pipeline every agent feeds forward. Nothing currently re-reads the reviewer's verdict with a different model or a different prompt to push back on it. Until that exists, you are the check. Read each "approve" as one reviewer's proposal, not as a consensus.

You still drive the car

The report tells you what the code claims to do. Only running it, either yourself or through the browser-validator, shows you what it does when a real browser, a real network, and a real user hit it. Reading the code narrows the search; running it is still where the behaviour gets confirmed.

THE TWO HALVES

Reading the code is the inspection. Running it is the drive. You need both. One tells you whether the engine was built according to the spec; the other tells you whether the car actually moves on the road.

The win is that you now walk into the drive knowing which ACs are suspect, which edges to probe, and where the developer's attention ran out. Your manual time goes where it's most valuable: on the scenarios no amount of reading the diff can prove on its own, including the regressions and integration surprises that only show up at runtime.

Nothing here replaces your judgment

Every agent in this workflow is a specialist working under you, not above you. The reviewer proposes a verdict; you decide whether its findings are fair and push back when they aren't. The test scenario designer drafts cases; you prune what doesn't matter for this release and add what it missed.

Even the browser-validator, the agent built to do the driving, is not a box you hand the keys to and walk away from. Think of it as autonomous driving: the car can steer, but you are still in the car. You watch the dashboard, see what it opened, which elements it clicked, which assertions it ran. You read the report it produced, you ask why it skipped a step, and you intervene when its interpretation of a scenario is wrong.

THE HUMAN STAYS IN THE LOOP

Agents propose. You decide. A verdict that says "approve" is an input to your judgment, not a substitute for it. The value of this workflow is that it hands you better material to judge with, not that it removes you from the judging.

A manual tester's new day

Before the change:

  1. Open the ticket.
  2. Open the app.
  3. Try to cover every AC from the outside, with no idea which ones the developer actually touched.
  4. File bugs when something breaks.

After the change:

  1. Open the ticket.
  2. Read a one-page report: three ACs have matching code, one looks partial, one has no code at all. Two edge cases flagged.
  3. Sanity-check the file references the agent cited.
  4. Open the app. Spend your time on the partial AC, the missing one, the flagged edges, and the regressions the diff can't prove on its own.
  5. File bugs, starting from the agent's references and confirmed by you.

Less busywork, more judgment work. That's the shift.

Where this fits in the workflow

Here's the full pipeline. Each agent does one thing and hands its output to the next:

environment-manager        check out the branch, start the app
    ↓
functional-reviewer        read the diff, compare it to the ACs
test-scenario-designer     draft the scenarios to run
    ↓
browser-validator          drive the car, verify against the UI
    ↓
bug-reporter               turn findings into developer-ready tickets

The inspection comes first. The drive comes second. The report comes third. That's the order a good mechanic would use, and now the order a good QA team can too.

Next