Your Tests Are Green. Your Prompts Are Broken.

I’m building a cheap and quick way to generate a perfect wiki for any codebase, using any LLM model available (local, cheap, fancy, etc). The goal is to generate flawless starting context for coding agents. I use a number of different methods of self-analysis, and found the fuzzy nature of llms is both a blessing and a curse. In this post I’m sharing one of the ways I’ve found works for assessing LLMs in real production code.

Building with LLMs and agents is genuinely fun for me. It feels like a “new” thing in my code, and it’s kinda weird.

I sketch out a prompt, run it a few times, get decent results, wire it into the system, and suddenly I’ve got AI doing something useful! It feels like magic. I expand, grow, test, build, then discover it’s generating a big old pile of poop when I use anything other than money-burning frontier models.

Sure, “evals” exist. I (well, Claude, if we’re being honest here) can build evaluation harnesses, curate datasets, and run benchmarks. But there’s a lazy way: just write tests. Regular, familiar tests that tell you when things break, help you improve iteratively, and let you benchmark different approaches without ceremony.

The catch? Normal tests can’t do this. They test the code, not the prompts.

The Mock Problem

When you mock an LLM call in a test, you’re basically saying “assume the LLM returns this perfectly formatted response.” And this is usually right for the chunky-boy models. Your test validates that you can parse JSON, that your business logic handles the data correctly, that everything wires together. Brilliant. Essential, even.

But it doesn’t tell you:

Does your prompt actually get the model to detect what it’s supposed to detect?
Does it hallucinate data that aren’t there?
Does it output valid JSON 100% of the time, or just in your carefully crafted mock?
Does it work with 500-line files, or just the 20-line examples in your tests?
How cheap and how fast can the model you choose be before then system degrades?

You’re testing the plumbing, not the water.

LLM-as-Judge

Use an LLM to evaluate your LLM outputs.

Instead of writing assertions like “the response must contain the word ‘SQL’”, you write assertions in natural language and let another model judge whether they’re true. Something like:

“The security analysis should identify the SQL injection vulnerability and specifically mention that string concatenation is being used unsafely.”

Then you give that criteria and your agent’s actual output to a “judge” model, and it scores the response from 0-10 with reasoning about why.

The first time I tried this, it was pleasantly surprising that it works as well as it does. Instead of “Test failed: expected ‘findings’ to have length > 0”, I got:

Score: 4/10. The analysis correctly identifies the file as security-relevant, but completely misses the SQL injection vulnerability. The focus on input validation is correct but insufficient. Improvements: Specifically flag the string concatenation in the query construction.

That’s… actually useful? It told me why my prompt wasn’t working and what to fix.

As a bonus, I’ve found great insights by tacking a QA feature to the end of llm calls, allowing me to actually question the details of the llm.

The False Positive Problem (Or: What You Don’t Find Matters More)

When I started writing these LLM tests, I naturally focused on detection: “Can it find SQL injection?” “Can it spot command injection?”

But then I deployed to a real codebase and got absolutely roasted by false positives.

This is when I learned that testing what your LLM shouldn’t find is actually harder—and more important—than testing what it should find.

So now I write explicit false positive tests:

“This code uses parameterized queries with TypeORM. The analysis should recognize this is SAFE and not flag it as SQL injection.”

And I give these tests a higher passing threshold (8/10 instead of 7/10), because false positives are expensive.

I test safe patterns religiously: password hashes that aren’t secrets, DOMPurify-sanitized HTML that isn’t XSS, environment variables that aren’t hardcoded credentials. All the things that look suspicious to an LLM but are actually fine.

The Test Pyramid, Adapted

The classic test pyramid still applies, but the layers mean different things:

Smoke tests check that your LLM-judge infrastructure even works. Can it score things? Does it fail when it should? You’d be surprised how easy it is to accidentally write an LLM assertion that always passes. Actually, you probably wouldn’t be surprised. Resigned, maybe.

Unit tests validate individual agents. Does your security agent catch SQL injection? Does it handle Python and Go, not just TypeScript? Does it work on 500-line files, or does it lose focus after 100 lines?

Integration tests check that multiple agents don’t contradict each other. If your security agent says “this password handling is safe” and your pattern agent says “detected hardcoded password”, something’s wrong.

Edge cases are where you test the weird stuff real codebases throw at you: 25-file commits, comment-only changes, deeply nested code, whitespace-only diffs. Turns out LLMs have opinions about whitespace changes.

End-to-end tests validate your entire pipeline. In my case, that’s the full wiki generation flow.

Making It Practical

Calling an LLM for every test assertion sounds expensive.

It is! Sort of. Unless… use a cheap, fast model as your judge (I use Qwen Turbo, which costs pence, or LM Studio if I want to set my laptop on fire), and save your expensive production models for the actual agents you’re testing. The judge’s job is to evaluate output against criteria, so it doesn’t need to be your smartest model.

I also log every test run with the scores, reasoning, and suggestions. This means I can track quality over time, catch regressions when I switch models, and identify which tests consistently score low (usually means my prompt needs work, not the test).

And crucially: combine deterministic assertions with LLM evaluation. Check that you got a response, that it has the right shape, that confidence scores are reasonable. Then let the LLM judge verify the semantic correctness. Layered testing isn’t just for traditional code.

What This Actually Buys You

After a few months of testing this way, here’s what’s changed:

I catch prompt regressions immediately. When I tweak a prompt or switch models, I know within minutes if something broke. Not days later when users complain.

I can refactor prompts confidently. Just like unit tests let you refactor code without fear, these tests let me experiment with prompt structures. If the tests stay green, the behavior is preserved.

I understand my failure modes. The score-and-reasoning format tells me exactly where my prompts are weak. “Scores 6/10 on large files but 9/10 on small files” is actionable information.

False positives are rare. Because I test for them explicitly, with high thresholds, and with realistic examples of safe code.

I can onboard new models faster. Want to try Claude instead of GPT-4? Run the test suite, see how it scores, make informed decisions.

The Bigger Picture

This isn’t really about testing. It’s about bringing the same confidence to LLM systems that we have with traditional code.

Test-Driven Development works because you write a test, watch it fail, make it pass, refactor. The tests give you permission to change things. They catch regressions. They document behavior.

Prompt-Driven Development (if we’re calling it that, sounds like a rubbish name frankly) needs the same cycle. Write an assertion, watch your prompt fail, make it pass, refactor. But you can’t do that with mocks. You need real LLM calls evaluated by real criteria.

If you’re building anything serious with LLMs—agents, RAG systems, code analysis, whatever - and you’re not testing your prompts with actual LLM evaluations, you’re flying blind. Your unit tests will be green. Your prompts will be broken. And you won’t know until production (enjoy debugging!).

Start small: pick your highest-risk use case, write one false positive test, and see what happens. I bet you’ll find something interesting.

If you want to try this approach, the key pieces are: (1) an LLM evaluation function that scores outputs with reasoning, (2) explicit false positive test cases, and (3) logging for quality tracking. The rest is just good testing practice, adapted for a world where your primary logic lives in natural language instead of code.

John Kershaw

John Kershaw - Tech Lead