Field Notes — AI Development Retrospective

The Mech Suit Methodology

How I accidentally reinvented the CLI, learned to feel context limits, and built a multi-agent development studio from scratch. One painful limitation at a time.

This is not a tutorial. It is a practitioner’s retrospective on a few intense months of AI-assisted development: what got built, where it broke, what got thrown away, and what emerged from doing it the wrong way first.

Steppe Integrations is my consulting and freelance practice. Not a research lab. A one-person shop that started with a web browser and a GPT tab and needed to ship real software. What follows is the honest account of how that workflow evolved into something worth trusting for production systems.

The end result is the Mech Suit Methodology. The AI does not replace your judgment. It amplifies it. You are still the pilot. You just move a lot faster.


Phase 01 — The Beginning

One Chat. One Context. One Wall.

The starting point is embarrassingly simple. One GPT web interface, one browser tab. The AI would produce code (sometimes a full file, sometimes a snippet, sometimes a downloadable zip) and those outputs got manually dropped into the local repository. Build. Test. Describe the next problem. Ask for more code. Repeat.

The first real demo was a mobile app. It worked. Everything stayed in a single chat session: no new tabs, no fresh context, just one long accumulating thread.

Workflow Diagram — Phase 1: The Single Thread
Human
Web AI
Local System
HUMAN you describe problem GPT WEB CHAT single tab · one session code / zip / snippet LOCAL REPO manual copy → build → test ⚠ manual copy-paste bugs / next feature → back to top CONTEXT BLOAT grows with every iteration
Limitation Discovered
With everything in a single chat, context bloat was inevitable. Line-ending conflicts (Windows vs Unix), non-breaking space corruption, and encoding nightmares made snippet-level merges painful. The solution? Ask the model to regenerate the entire file. Fine for tiny projects. A dead end at any real scale.

The harder limitation was invisible at the time: I could only build things I already knew how to build. The moment I ventured into unfamiliar territory, context drifted. Quality degraded. I could not hold it together. That wall is where the real education started.

Phase 02 — Context Management

Own Your Context. The Web Client Won’t.

The first adaptation was practical: stop trusting the web client to preserve state. Screenshots via Windows+Shift+S went into a local directory for visual reference. More importantly, the AI would generate context markdown files (structured summaries of the current project state, decisions made, and open questions) so a fresh session could pick up from cold without losing the thread.

It is like calling a consulting agency. They might have your notes and your case file. But you are not guaranteed to get the same architect you worked with last week. Different instance, possibly a different model version, different system prompt tuning. The model does not remember you. Your context lives on your machine, or it lives nowhere.

Workflow Diagram — Phase 2: Context Scaffolding
Human
Web AI
Artifacts
Local System
ARTIFACT STORE screenshots · context.md capture HUMAN you describe WEB AI generate + summarize code output REPO build · test bugs / next feature reload context
Insight
The AI does not remember you. You have to remember yourself. Your filesystem is the only guaranteed persistent store. Once that sinks in, every interaction becomes about generating artifacts, not just code.
Phase 03 — The Accidental CLI

Feel the Pain. Solve It Badly. Find the Abstraction.

The copy-paste workflow was killing me. The model clearly knew what code to produce and understood the file structure. There had to be a better way to get it into the repo. So one evening, instead of building the mobile app, a PowerShell script happened. The idea was simple: the AI gives a command, click copy, paste it into the script, which reads the git state and merges the changes locally.

I had a homegrown CLI. I just didn't know that is what it was called yet.

After it was working, the obvious thought arrived: “Someone must have solved this already.” A quick search later — oh. CLI tools. That is how they do it. Once a proper CLI replaced the homegrown bridge, the friction dropped dramatically.

Workflow Diagram — Phase 3: The Eureka Moment
Human
Web AI
CLI Bridge
Local System
WEB AI generates commands copy PS CLI BRIDGE ghetto CLI™ evolves PROPER CLI direct repo access REPO build · test "oh — CLI tools exist." felt the pain → solved it → found the abstraction
Unlock
I built that PowerShell bridge without knowing that CLI tools for AI development already existed. I discovered the genre by solving the problem myself first — then went looking for prior solutions. That ordering matters: understanding what is being abstracted away at each level is foundational for me to internalize new technology. And once the copy-paste bottleneck disappeared, everything accelerated.
Phase 04 — Multi-Agent Orchestration

Three Agents. Each Doing One Thing.

Once the CLI was working, the next leap was obvious: what about two agents? The roles started to separate. The web client became the co-architect: a place for refining complex prompts, thinking through design, and breaking plans down into what became “sonnets”: small, precise, self-contained implementation chunks. Those sonnets would then get handed off to a cheap model via CLI for execution.

The economics were deliberate. Limited tokens forced discipline: the expensive architect session refined the prompts themselves, while the cheap builder just executed them. Prompt engineering became a distinct phase, separated from implementation.

Then came a third agent. Claude and Codex working in the same repo, each optimized for different strengths: Codex handled architectural questions and research, the web client refined prompts, and the CLI builder executed sonnets. Three agents feeding each other, each one’s output shaping the next one’s input. The human orchestrated the handoffs.

Workflow Diagram — Phase 4: Three-Agent Orchestration
Human (Orchestrator)
Web Architect (Claude)
Codex (Research)
CLI Builder
Plan Files
CODEX structural Q&A · research codebase analysis findings YOU orchestrates all handoffs refine ↕ WEB CO-ARCHITECT synthesize · plan · refine opus / gpt-4 plan PLAN FILES sonnet chunks in /plans/ CLI executes cheap model LOCAL REPO commits · tests 🎯 expensive refine → cheap execute · each agent optimized for one cognitive task
Unlock
Separating prompt engineering from prompt execution is where the leverage lives. One expensive fifteen-minute architect session produces a structured plan that can drive hours of cheap builder work. Different agents for different cognitive tasks. That is the mech suit principle.
The model is always the variable. Once your workflow is dialed in, you are the constant. That is both the power and the responsibility.
Phase 05 — Parallel Pipelines

Stop Being the Bottleneck.

The single orchestration loop was good. Running multiple orchestration loops in parallel was a different category of capability. With compartmentalized web chat sessions, each dedicated to a specific architectural problem with no mixing of concerns, one plan could be in refinement while three CLI workers executed previous plans in parallel.

The biggest unlock was inverting the handoff. Instead of manually feeding sonnets into the CLI one by one, the entire refined plan went into a folder in the repo, each section as its own file. Then one command to the agent: “work through these files.” One expensive planning session could spin off into hours of autonomous execution.

That enabled three to five concurrent builds. Cognitive load shifted entirely to planning and refinement. Execution ran on its own. This is when entirely new territory opened up. Not because the AI was smarter, but because the human-AI loop had enough throughput to actually research and prototype in real time.

Workflow Diagram — Phase 5: Parallel Pipelines
Human (Planning Mode)
Plan Files
Workers
Repo
HUMAN PLANNING MODE architect · refine plan PLAN FILES /plans/ in repo step-01…step-N WORKER 1 CLI Builder executing plan A WORKER 2 CLI Builder executing plan B WORKER 3 CLI Builder executing plan C REPO parallel commits tests running
The New Limit: Your Own Working Memory
Running parallel pipelines surfaces the next constraint fast: tracking all the moving pieces. Context across three concurrent build threads is a lot to hold in your head. That friction drove the next evolution: artifact creation as a discipline, not an afterthought.
Phase 06 — Mature System

The System That Knows Itself.

The mature workflow looks different from the outside. The real changes are about discipline and instrumentation, not just adding more agents.

Single-concern workers. One worker touches one concern per session. No mixing UI changes and server code in the same context window. The moment you think “just one quick UI tweak” while the worker is mid-implementation, you are setting up a context collision that can take hours to unwind. I have reverted entire UIs because I did not respect this rule.

UI mocking as a separate phase. Stitch handles design iteration completely isolated from implementation. Build the mock against personas and product objectives. Lock the design. Then hand it off as a contract to the builder. No semantic drift possible because the builder is not making design decisions. They are executing one.

Agents do their own retrospectives. After three to six hours of committed builds, the agent scans the docs, checks for semantic drift, double-checks naming, updates READMEs, and produces a retrospective on the work. The context is still hot. The quality of that self-review is dramatically better than doing it cold the next day.

Architectural Decision Records. CLI agents update ADRs as they build. Every significant structural choice gets recorded with its reasoning. The next session — or the next agent — has a record of why decisions were made, not just what was built.

Workflow Diagram — Phase 6: The Mature System
Personas / ADRs
Design
Architecture / Review
Workers
PERSONAS product objectives ADRs architectural decisions STITCH UI MOCK locked design contract design COMPONENT ARCH routing · decomposed UI WORKER react only · no API API WORKER server only · no UI INFRA WORKER db · docker · config SESSION END → SELF-RETROSPECTIVE agent reviews own work ADR UPDATES decisions + reasoning REPO DOCS README · semantic scan NEXT SESSION SHARED REPOSITORY commits · tests · PRs — one concern per worker, all merge here
The Final Unlock
When every step of your workflow produces a deterministic, stateful artifact, you can point an agent at the entire repository and ask: “Tell me about this project.” The ADRs, the component architecture, the personas, the retrospectives — there is enough signal to reconstruct the reasoning behind every decision. The repo can tell you about itself.

Closing Thoughts

The Mech Suit, Not the Pilot.

The throughline across all six phases is this: the AI amplifies your judgment. It does not replace it. In phase one, my judgment was the bottleneck — I could only build what I could already design. In phase six, my judgment is still the bottleneck — but now it operates at dramatically higher leverage, with better tooling, better observability, and better discipline around context hygiene.

A few things I would tell an earlier version of myself:

You will hit the wall at 1,200 to 1,400 lines. Past that point in a single file, context drift is not a risk — it is a guarantee. Design for modularity before you build, not after.

One concern per worker. Always. Mixing UI and server work in a single context window is not a shortcut. It is a debt you will pay in full when you have to revert everything.

Build it wrong first. I built a PowerShell CLI bridge before I knew CLI tools existed. I am glad I did. I understood the problem before I discovered the solution, and that ordering makes a real difference in how deeply you understand both.

The variable is the model. Once your workflow is dialed in, the model is what changes. You are the constant. That means your workflow design — your discipline around context, handoffs, artifact creation — is worth more than any individual model improvement.

This is still early days. The loops are getting tighter, the tools are getting better, and the surface area of what a single practitioner can responsibly build keeps expanding. The question worth sitting with is not which AI tools to use. It is how deliberately you are designing the human-AI loop itself.