May 2026

100 Tiny Users

A local product-eval lab that runs synthetic browser users through real app workflows, records where the experience breaks, clusters the pain, and turns the evidence into repair prompts for coding agents.

Agentic TestingBrowser UseDeveloper ToolingEval Loops

Visit source

This project came from a very specific irritation:

I hate writing tests.

Not because I think testing is useless. The opposite, actually. I hate the gap between the confidence tests pretend to give you and the confidence your users actually deserve.

LLMs and agents do not magically fix that either. They are very good at producing tests that pass, fail, and look respectable in a repo. But a lot of those tests still end up checking weirdly specific implementation behavior instead of the thing that matters most: did the actual user experience get better, or did the product still break the moment a real person touched it?

People vibe code projects all the time now. They ship with tests. They feel safe. Then users show up, do normal human things, and the product falls apart.

That sucks for everyone involved.

So I built 100 Tiny Users around a simple idea:

create fake users and make them suffer, so real users do not have to.

The Concept

The project uses synthetic users to test a product through the browser UI instead of poking the app programmatically from the side.

Agents have become genuinely good at browser use and computer use. So instead of asking them to write another brittle unit test, 100 Tiny Users turns that strength toward the actual product surface. Each user has a different personality, intent, patience level, device shape, accessibility need, and risk profile. Some are normal. Some are impatient. Some use screen readers. Some paste way too much text. Some are malicious because the internet is what it is and unfortunately so are some of the people on it.

They interact with the product like people would.

If something breaks, the system records it. Screenshots, traces, console logs, network logs, DOM snapshots, accessibility snapshots, structured findings, replay commands, the whole thing. Then the failures get clustered into something a human or coding agent can actually act on.

The end of a run is closer to:

this kind of user got hurt
this is what they were trying to do
this is what the product promised
this is what actually happened
here is the evidence
here is a repair-ready prompt for Codex or Cursor
rerun the same users after the fix and see if the experience improved

Which is what I care about the most.

What It Runs Today

The current prototype is a local Next.js eval lab with two product surfaces built into it:

Portal, a public hackathon submission flow
Workbench, an internal customer-operations queue

Portal started as the first demo surface because it was simple enough to explain quickly: submit a project, reject duplicate teams, stay accessible, survive long pasted text, and not execute malicious input.

Workbench is more comprehensive. It simulates a production-ish internal operations tool with queue search, ownership changes, admin identity confirmation, billing credits, follow-up state, and the kind of boring workflow details that usually expose whether a test harness is actually useful or just performative slop.

The repo now has:

config-driven target and workflow definitions
deterministic Playwright browser execution
a semantic mini-user harness
an external webhook harness for outside executors
persona-based replay across different user archetypes
artifact capture for screenshots, traces, console logs, network logs, DOM, and accessibility state
SQLite-backed run metadata
dashboard visibility for the latest run
repair packet generation for Codex and Cursor

In a nutshell, this is a small local product-evaluation loop where you can run fake users, collect where the experience broke, generate the repair context, patch the product, and run the same cohort again.

Why This Mattered To Me

I built the first prototype at the Codex Emergency Hackathon, which was part of the AI Engineer Conference Singapore. That version targeted a fake hackathon submission portal and proved the basic before-and-after loop.

Then I got to showcase it at OpenAI's GPT 5.5 Demo Day, alongside a couple of other things I had built (yes, deptrace and friday-for-codex).

Demo Day Picture

This is when the project actually felt a little bigger than just me building something for my own use.

I say that because after my presentation, people came up to me in person and someone even reached out on LinkedIn the next day to talk about how useful the concept was. The conversations were not just polite demo-day small talk either. They got into agent limitations, sandboxing, access boundaries, product workflows, and where this kind of testing could fit into their own teams.

That felt really good. Something I built purely for myself suddenly had other people saying 'wait, I need this too.'

The first demo was just that, a demo. The more complete prototype came together at the AI Engineer Hackathon the following week, where I pushed it from a fake submission portal into the more production-shaped Workbench surface. I missed the demo upload deadline by a couple of minutes, which is deeply annoying and also very much on me lol.

But the project itself held up.

The Part I Like Most

The thing I like about 100 Tiny Users is that it treats tests more like user evidence instead of just "code coverage" sort of proof.

A passing test suite can still leave you with a broken product. A browser user getting stuck is much harder to dismiss, especially when you have the trace, screenshot, DOM, accessibility state, and replay command sitting right there.

This is also the kind of developer tool I actually enjoy building. Not stupid SaaS for the sake of having a SaaS-shaped thing, but something that exists because I felt the pain directly of its absence and could see exactly how many other builders feel it too.

That actually made me realize something.

I am good at & like building tools for developers because I am one. I know what parts of the workflow feel fake, what parts feel annoying, and what kinds of automation actually reduce pain instead of adding another thing to babysit.

There is still a lot I want to scale here: real target onboarding, stronger sandboxing, richer agentic users, better root-cause clustering, hosted runs, and cleaner repair orchestration, but even in its current prototype form the core idea just feels right.

Build the product. Send in the tiny users. Watch where they suffer. Fix the thing. Send them in again. See your product actually hold up.

That's what needs to give you confidence instead of a "67 tests passed in 23.4s" sort of a metric by some random agent you send off to run tests in your codebase.

Projects