Field Notes · The Precheck Cut
06 / 08

Determinism as a Design Requirement

Most planning tools treat replay as a nice-to-have. Precheck inverts it: replayability is a constraint on every other decision. Here is what that costs, what it enables, and why it is the reason the repo can tell you about itself.

Six questions. Printed at the top of the Precheck system foundation. They are not a wishlist. They are the test for whether the system is allowed to exist.

If an archived run cannot answer all six from stored data alone, without reading logs and without reconstructing intent, the foundation is incomplete. Not suboptimal. Incomplete. The build is not done.

Every planning system I have seen treats replay as a feature. A thing you add later if you have time. A nice-to-have that ranks somewhere between "export to CSV" and "dark mode" on the priority list. In Precheck, replay is the opposite of a feature. It is the constraint that every other decision has to survive. The difference sounds subtle when I write it like that, but it reshapes the architecture at a deep level, and it is the reason the repo has the property I described in post one — the ability to tell you about itself — instead of being just another well-documented codebase. This post is about what that inversion actually costs and what it enables.

The six questions

I quoted this in post one, but I want to quote it again here because it is the foundational constraint that the rest of this post is about. It lives at the top of the Precheck system foundation, in the document named 00-README-FIRST.md, under the heading "Definition of success."

Receipt docs/system-foundation-v2/00-README-FIRST.md
The rebuild is on track when one archived run can answer all of the following without recomputation: - What was evaluated? - Which guardrail affected the outcome? - Why did the run continue, retry, decompose, skip, or stop? - What suggestion was generated? - What changed between attempts? - Did the path earn its depth, or was it stopped for churn? If the answer requires reading raw provider output or reconstructing intent manually, the foundation is incomplete.

Six questions. Each one is a thing an operator should be able to ask about any archived run and get a definitive answer from the archive alone. Not "go look at the logs." Not "re-run the scenario and see." Not "ask Derek." Open the archive, read the answer, done.

The phrase that matters most in that list is "without recomputation." Every question has to be answerable from the archive as stored, not from anything that has to be re-derived at read time. That constraint sounds mild. It is not mild. It means every answer has to be written into the archive at run time, which means the run time code has to know in advance what questions will be asked, which means the run time code has to commit to an explicit shape for every answer, which means every stage of the planner has to produce structured records instead of free-form rationale strings.

And "recomputation" has a specific, strict meaning here. Reading a structured field out of an archive is fine. Running a projection function over a list of canonical events is fine — the canonical events are the truth, and the projection is a view. What is not fine is running the planner again, or calling the LLM again, or reconstructing intent from logs. The archive has to be self-contained. Every question has to resolve against the bytes that were stored when the run happened.

Why this is the inversion

Most systems have features. The features accumulate, and at some point someone asks "can we replay these runs?" and a replay system gets built on top of whatever the features happened to produce. The replay system has to cope with whatever information is available, which is usually whatever got logged. If the logs are sparse, the replay is sparse. If the logs are verbose but unstructured, the replay works for humans but not for programs. If the logs were rotated, the replay doesn't work at all. Replay is downstream of features, and its quality is limited by what the features chose to emit.

Precheck does it the other way around. The six questions are the requirement. The features have to produce data that answers the six questions at archive time, or the features are not done. A new feature that adds a capability but doesn't extend the archive schema to cover its new behavior is an incomplete feature, even if it "works" in the running system. The archive is not downstream of the features. The features are downstream of the archive.

This is the inversion. It flips the dependency direction. And it has sharp consequences for every other decision.

What the inversion costs

Let me be specific about what gets harder when you commit to this.

First, the persistence boundary gets complicated. You cannot just write "whatever the provider returned" into the archive. The provider's output is unstable — different providers format things differently, same provider can format things differently across versions, and some fields are free-form text that you cannot parse reliably later. You have to normalize the provider output into a canonical shape before it enters the archive. That normalizer is a real piece of code. It has to handle every provider's quirks, every edge case, every version drift. And when normalization rules change, you have to re-harvest from the raw data, because your normalized store is now out of date.

There is an ADR in the Precheck repo that captures exactly this tradeoff.

Receipt docs/system-foundation-v2/adr/008-normalization-before-persistence.md
## Decision All guardrail outcomes pass through IGuardrailNormalizer before being written to guardrail_outcomes. Normalization handles: - Result casing: PASS→pass, FAIL→fail, WARN→partial, PARTIAL→partial - Category lookup: joins against GuardrailEntry from the intake document - Evidence signature: stopword removal + number normalization + 120-char truncation + SHA256 hash (12 hex chars) Raw data stays in archive_json per ADR-002. Learning tables only contain normalized data. ## Consequences - Cross-run aggregation operates on consistent data - Evidence signatures enable failure pattern grouping without exact text matching - If normalization rules change, a re-harvest from archives is needed to update existing rows

That last consequence is the honest one. "If normalization rules change, a re-harvest from archives is needed to update existing rows." You pay for the rules you committed to. If you discover a better normalization, you have to re-run it across every historical archive. That is a real cost. It also means you can afford to improve the rules over time — the raw data is still there, in archive_json, so re-harvest is always possible. The cost is "some CPU time and a migration," not "data lost forever."

Second, the canonical archive has to be treated as the only source of truth. No shortcuts. No "let me just read the logs this one time." If you start cheating and reading the logs, the logs become load-bearing, and now you have two sources of truth that can drift, and the whole property you were trying to preserve is gone. The Precheck repo has an explicit ADR for this, too.

Receipt docs/system-foundation-v2/adr/002-canonical-archive-as-truth.md
## Decision Treat the canonical archived run as the authoritative record of what happened, why it happened, and what happened next. Logs, provider payloads, and future observability layers remain supporting evidence only. ## Alternatives considered - Provider output as truth: rejected because provider payloads are unstable and not planner-owned. - Trace system as truth: rejected because replay would depend on external infrastructure. ## Consequences Replay, validation, golden fixtures, and operator tooling all center on the same archive format. Any architectural change that affects semantics must update contracts, fixtures, tests, and docs together.

The last line of consequences is the quiet one: "Any architectural change that affects semantics must update contracts, fixtures, tests, and docs together." That is the ongoing tax. Every change that touches the shape of a run has to ripple through four places. If you skip any of them, the archive becomes inconsistent with something, and the invariant breaks. You commit to the tax or you do not commit to the trust model. There is no middle ground.

Third, deep runs have to "earn their depth." This is the phrase from the system foundation that took me the longest to understand. A deep run — ten to fifteen nodes in the decision tree, multiple retry cycles, decomposition into sub-features — is only allowed if each step records a real state change. If a step is a no-op (the model said the same thing with slightly different words, no new information emerged), the runtime has to detect that and block the step or force a terminal stop. Otherwise a "deep run" is just repeated churn that looks impressive but contains no new signal.

Receipt plan/deterministic-deep-run-plan.md
By the end of this slice, the system must guarantee: - every non-initial bounce records at least one valid canonical state delta - retry vs decompose vs stop is deterministic for the active node - lessons can be traced from creation to applied effect - decomposition creates a real new decision surface or is rejected - no-op attempts are blocked and can force STOP - operator and renderer surfaces can explain why a deep run continued or terminated

Each of those guarantees is a real piece of enforcement code. Every bounce must record a state delta (if it doesn't, reject the bounce). Retry vs decompose vs stop must be deterministic for any given node state (if it isn't, the decision logic has a hole). Decomposition must open a genuinely new decision surface (if it doesn't, it is a relabeled retry, and the planner has to reject it). No-op attempts must be blocked (if they aren't, the planner can get stuck in a loop that looks productive). Every one of these is work. The work only makes sense if the answer to "did this path earn its depth" has to be in the archive. If you don't care about answering that question, you don't do this work, and your deep runs can drift into semantic churn and nobody notices.

Fourth, and most subtly, the planner cannot have hidden memory. If the decision logic is allowed to read state that is not in the archive, the archive is no longer self-contained, and replay becomes impossible. This has a specific consequence for learning: the learning system cannot modify decision thresholds or branch-selection logic based on prior runs, because those modifications would not be visible in the current run's archive. Learning has to work by injecting visible context into the prompt — a <learning_from_prior_runs> block that appears in the provider request and is therefore captured in the archive alongside everything else. The next post goes into this in detail, but I want to mention it here because it is a direct consequence of the determinism requirement, not a separate design choice.

What the inversion enables

I have spent most of this post on what determinism costs. Now I want to tell you what it buys.

The first thing it buys is the property that the rest of this series is about. The repo can tell you about itself. The reason the repo can tell you about itself is that every archived run contains enough structured data to answer the six questions without recomputation — and therefore every historical run is a self-contained artifact that an agent can read and interpret. Without the determinism requirement, the archives would be sparse, unstructured, or dependent on external state, and an agent reading them cold would have to guess. With the determinism requirement, the archives are complete, structured, and self-contained, and an agent reading them cold gets the truth.

The second thing it buys is explainability without trust. An operator looking at a Precheck run does not have to trust that I remember what the planner did. They can read the archive and see it. They can see which guardrail fired, what evidence the model produced, which suggestion got generated, what changed between retries, and whether the run stopped for churn or for bounds. They do not need to trust the operator, the developer, or the tool. The archive is the truth, and the archive is readable.

This matters disproportionately for AI systems. When a deterministic algorithm produces a result, people are willing to trust it because they understand the rules. When an LLM produces a result, people want to understand why, and "the LLM decided" is not a sufficient answer. The six questions are a proxy for the deeper question "why should I believe this?" — and the archive's structured answers to those six questions are the material out of which trust can be constructed. Without the determinism requirement, you can't answer the six questions, and you can't construct the trust.

The third thing it buys is the ability to improve without breaking. Because the raw data (archive_json) is preserved alongside the normalized data, I can change my normalization rules later and re-harvest. Because the canonical archive is stable, I can change the projection logic that renders views of it without touching the underlying data. Because the planner's decisions are reproducible from the archive, I can replay old runs against new planner code to see if the new code would have decided differently. Every one of those operations depends on the archive being authoritative. They are free in Precheck because the determinism requirement was paid for upfront.

Unlock

The expensive investment in determinism compounds. Every feature you ship against a deterministic archive works better the longer the archive exists, because the archive itself becomes a dataset you can ask new questions of. In a non-deterministic system, old runs are sunk cost. In a deterministic system, old runs are an asset that appreciates.

The line I had to cross

Committing to determinism as a design requirement is not a natural act for a solo developer. It feels like ceremony. It feels like over-engineering. Every time you are about to add a feature, you have to stop and ask "how does this show up in the archive, what structured data does it need, and how do I guarantee the answer is reproducible." Those questions slow you down. They make features take longer. They produce pull requests that touch the contracts folder, the migrations folder, the tests folder, and the docs folder all at once.

The line I had to cross was accepting that those questions are not friction. They are the product. The value of Precheck is not in any individual feature. It is in the property that every decision is auditable. If I drop that property to ship features faster, I am trading the product's core value proposition for velocity, and the math on that trade is never going to work out. The features will ship faster. The thing the features were supposed to enable will erode. Six months later I will be writing a retrospective about why Precheck stopped being useful, and the answer will be "I stopped paying the determinism tax."

The discipline is to pay the tax every time. Every feature. Every ADR. Every plan. The six questions are the check, and if the answer to any of them gets harder because of a proposed change, the change is wrong.

Back to the arc

The parent article's final line — "the loops are getting tighter, the tools are getting better, and the surface area of what a single practitioner can responsibly build keeps expanding" — has the word "responsibly" in it doing heavy lifting. Responsibility in AI systems means being able to answer, in detail, why the system did what it did. Determinism as a design requirement is what makes responsibility possible. Without it, the loops are tight and the tools are good, but nobody can answer for the outputs.

The next post steps away from philosophy and into operator experience. When you are running three to five parallel pipelines, you cannot hold the mechanics of each project in your head. The runbook is the product. Phase 6 applied to DevEx. It is a short break before the series closer, which brings the determinism thread back with a specific example — the four-state lesson lifecycle that makes learning auditable.