03 · Stress-Testing Design Before You Build It

The Mech Suit Methodology has a line about the economics of multi-agent orchestration: "one expensive fifteen-minute architect session produces a structured plan that can drive hours of cheap builder work." That line is doing a lot of work. The hidden premise is that the architect session has to be rigorous enough — has to catch the mistakes expensive enough — that the builder hours that follow do not produce a doomed feature. Stress-testing is the discipline that makes that true. Without it, the expensive session just produces expensive mistakes faster.

This post is about the time the stress-test worked. A progress bar was designed, interrogated against how the planner actually behaves, and killed at the refine stage. The builder never saw it. No code was written. The design time was not zero — I did spend roughly forty minutes drawing the thing and reasoning about its states — but forty minutes of architect time is a rounding error compared to the four or five hours of builder time that would have been spent implementing and later reverting it. That is the Mech Suit principle in its cleanest form: expensive refine, cheap execute, and the refine step has to be rigorous enough to find the bodies before they ship.

What I was trying to solve

Precheck runs planning sessions in parallel. Up to five concurrent runs. Each run has its own card in the Run Queue panel — a small tile that shows what the run is doing, roughly how far along it is, and whether it looks healthy. The cards update in real time as server-sent events arrive from the engine.

The design problem was: how do you communicate "roughly how far along" for a non-linear process? The planner doesn't march through stages in order. It branches. It retries. It decomposes features into smaller features, then decomposes those. A run might hit the ASSESS stage three times on the same feature, then jump back to REFINE, then decompose into a new subtree that starts over from SELECT. The trajectory is a tree, not a line.

My first instinct was the obvious one. Pick the major stages — SELECT, REFINE, ASSESS, then two more for "resolving" and "finalizing" — and render them as a horizontal progress bar. Five boxes. Fill them in as the run advances. Simple. Familiar. Every operator knows how to read it.

It lasted forty minutes.

What stress-testing actually looks like

The stress-test is not a formal process. It is a conversation with the design in which you try to break it. You pick a specific scenario — not an average scenario, a specific edge — and you walk through what the design would show. Then you ask: does that communicate something true? Does the operator looking at this understand what is happening? Does the design imply something that is not real?

Scenario one for the progress bar: a run hits ASSESS, fails, retries, fails again, retries a third time, and finally accepts. According to the five-stage bar, the run was at stage 3 the whole time. The bar does not move. The operator watching it sees "Stage 3 of 5" for four minutes and concludes the run is stuck. It is not stuck. It is doing exactly what the planner is supposed to do — exercising the assess-and-retry loop that is the whole point of constraint-aware planning. The bar is lying about the thing the operator most needs to understand.

Scenario two: a run decomposes. A feature gets split into three sub-features, and the planner starts processing the first sub-feature from stage 1 again. Where does the bar go? Does it jump backward? Does it stay at the parent's stage? Does it multiply into three parallel bars? None of the answers are good. The design assumed one run, one linear path, and the planner does not work that way.

Scenario three: a run hits STOP. The run did not complete. It did not reach stage 5. But it is also not running anymore — the stop decision is terminal. The bar has to show three stages filled, two empty, and somehow communicate "this is the final state" without implying "this run has two more stages to go." There is no visual language for that. Every cue the bar gives the operator is wrong.

Forty minutes in, I had a design that lied in three common scenarios and was ambiguous in a fourth. The five-stage progress bar was plausible. It was wrong.

Limitation Discovered

Linear visualizations of non-linear processes are a design anti-pattern that feels right because they pattern-match to familiar UX. The familiarity is the trap. If your underlying model is a tree with retries and decomposition, do not start with a progress bar.

The artifact that killed it

After the progress bar was killed, the agent writing the learning subsystem retrospective caught the event and wrote it down. This is one of the numbered entries in that retrospective.

Receipt docs/retrospective-2026-04-02-learning-subsystem.md

4. Linear progress bar was designed, stress-tested, and killed. The 5-phase progress bar (■ ■ ■ ■ □) was plausible but wrong — the planner is non-linear. The stress test caught it before build, but the design time was wasted. Lesson: if the underlying model is non-linear, don't start with a linear visualization.

I want you to notice a few things about that receipt. First, it is terse. Four sentences. It does not belabor the point. Second, it is honest about the cost — "the design time was wasted." Not "saved by the stress test." The stress test saved the builder time, but the forty minutes of architect time is gone regardless. Third, it states the lesson as a rule: if the underlying model is non-linear, do not start with a linear visualization. Future Derek, or a future agent, or a future contributor reading this retrospective knows to check the underlying model before reaching for the familiar UX pattern.

This is what a good retrospective entry looks like. Specific event. Honest cost accounting. Transferable lesson. No victory-lap language. The retrospective itself was written by the agent at the end of the session, while the context was still hot — which is why the lesson is sharper than anything I would have written three days later when the memory of the dead mockup had blurred.

What replaced it

The design that shipped is completely different. Instead of a progress bar, each run card shows a stage-aware status line: the current stage name, a fraction counter that represents locked work (features that have reached a terminal state) over total work, a canonical activity label that describes what the planner is doing right now, a health dot (green/yellow/red based on stop pressure and retry concentration), and — in an expandable section — a feature table showing the per-feature state.

None of it is a progress bar. None of it pretends the process is linear. The locked-fraction counter tells you how much work has reached a terminal state, which is a real number that only ever goes up. The activity label tells you what the planner is doing right now, which is always current. The health dot tells you whether the run is in a concerning state independent of how far along it is. Together, these communicate "roughly how far along and whether it looks healthy" without implying a linear trajectory.

The receipt for the replacement design lives in the same retrospective:

Receipt docs/retrospective-2026-04-02-learning-subsystem.md

5. The computeRunState() pure function. Deriving all run card state from TreeNode[] with no new backend state was the right architecture. The fraction counter, health dot, and feature table all come from runtime artifacts.

The design that shipped is a pure function of the runtime artifacts that already exist. No new backend state. No synthetic stage mapping. No invented progress semantics. The card is a projection of the tree, and the tree is what it is — non-linear, branching, sometimes terminating early. The card communicates that honestly, and the operator's mental model stays aligned with the actual planner behavior.

The part that took me a while to understand

Stress-testing is not free. The forty minutes I spent on the progress bar were real minutes. In a world where I had just started building it instead of interrogating it, I would have had a working-looking feature faster. The stress-test is an investment that feels slow right up until the moment it pays off, and the payoff is never visible — you get credit for the work you did not have to revert, which is an invisible kind of win.

The trap is that people who skip stress-testing get to "working" faster, and "working" looks like progress. The feedback loop punishes you in the short term for being rigorous. You have to make peace with the idea that the architect session is supposed to feel slow. If the architect session feels fast, it is probably missing the edges. If it feels slow because you are walking through scenarios and breaking your own designs, it is doing its job.

The corollary is that stress-testing has to be cheap enough to actually happen. The reason I stress-tested the progress bar for forty minutes was that the cost of the test was just my time and some rough mockups. There was no framework to set up, no stakeholder to schedule, no review board to convene. Just me, the design, and three specific scenarios I could walk through in my head. If stress-testing had been a formal process with overhead, I would have skipped it. The discipline works because it is lightweight enough to be the default.

Insight

The stress-test has to be cheaper than the cost of the mistake it catches. A forty-minute mental walkthrough that saves five hours of builder work is the ratio you want. If your stress-testing process takes longer than the thing it's testing, you will skip it, and the whole discipline collapses.

The other thing that died

The same retrospective has a second kill I want to mention briefly, because it illustrates a different flavor of the same discipline.

Receipt docs/retrospective-2026-04-02-learning-subsystem.md

5. Tab restructure was proposed and rejected. A 7-tab persona-based restructure was designed, stress-tested, and killed because it would recreate the Projection tab's redundancy problem under new names. The learning_diff_report markdown was already better than any tab could be. Lesson: check if the backend already produces the consolidated view before designing UI to fragment it.

A different feature, a different stress-test, a different kill. A proposed UI restructure would have split the operator's view across seven tabs organized by persona. The stress-test was: walk through the operator's actual task flow and ask whether the tab boundaries match the task boundaries. They didn't. The operator would be hopping between tabs for every real task, which is the same problem the existing UI had, rearranged. And the deeper discovery was that the backend was already producing a consolidated markdown report (learning_diff_report) that was strictly better than any multi-tab view — the right answer was to surface the report, not fragment it.

Two kills in the same retrospective. Two designs that looked reasonable on paper, died in stress-testing, and never cost a builder session. The pattern is the same: specific scenarios walked against the design, the design failed a scenario, the design was killed, the lesson was written down. That is the refine step doing its job.

What I'm still figuring out

The limitation I keep bumping into is that stress-testing is mostly me. It is a skill I am still developing, and when I am tired or rushed I do not do it well. The scenarios I pick are not always the right scenarios. I miss edges that a more experienced designer would catch.

The thing I have started experimenting with is explicitly asking the architect-session agent to generate stress-test scenarios for me before I commit to a design. "Here is the design. Walk me through three scenarios where this fails. Include at least one terminal-state scenario and one retry-loop scenario." The agent is not perfect at this — it sometimes generates scenarios that the design handles fine and misses the ones that break it — but it has caught things I missed, and the cost is low enough that it is worth running as a default.

The pattern I am circling toward is: the human picks the design, the agent generates adversarial scenarios, the human walks the design through those scenarios. That seems to be the right division of labor. The agent is good at enumerating edges. The human is good at judging whether the design's behavior in an edge is actually acceptable. Together they stress-test better than either alone.

Back to the arc

The parent article frames Phase 4 as "three agents, each doing one thing" and identifies the expensive refine / cheap execute split as the core leverage. This post is the evidence that the refine step is doing something real. A progress bar died in the refine step. A tab restructure died in the refine step. Neither of them ever cost a builder session. That is the leverage showing up as specific dead features in a retrospective.

The next post goes further into the same discipline from a different angle: what happens when you use the refine step to diagnose why an existing feature is not working the way you intended. The stress-test catches bad designs before build. Root-cause analysis catches bad implementations after build. Same muscle, different use.