Agents write at GPU speed, yet we ship at human speed.

Validation is the last bottleneck between autonomous code and autonomous delivery

May 11, 2026

AI agents can now write code. Actually a lot of code, SemiAnalysis estimates 4% of public GitHub commits are now authored by Claude Code, on a trajectory toward 20% by year-end. That part is no longer theoretical. They can take a feature request, inspect a codebase, modify files, run tests, fix their own mistakes, explain their reasoning, and generate something that often looks indistinguishable from what a human engineer would have produced. The industry has moved very quickly from autocomplete to centaur-like implementation, with human + AI working together. The machine is no longer merely suggesting the next line. It is increasingly capable of doing the work.

But there is a difference between writing code and shipping code.

That distinction is where the next major battle in software engineering will be fought. Today, even in companies moving aggressively with AI, generated code still typically flows through human review. The agent writes, the human inspects. The agent proposes, the human approves. The agent accelerates implementation, but the release decision remains human. We have introduced GPU-speed code generation into a software delivery system still gated by human-speed validation.

That is the paradox of the current moment. We have made writing faster without making (value) delivery faster.

Earlier this week, an exchange on Twitter caught this very tension. Reacting to the news that non-technical teams at Coinbase were shipping production code, one user wrote: “genuinely terrifying lol, already had my data leaked enough before this.”

Brian Armstrong replied:

Brian Armstrong@brian_armstrong
@MonetSupply @moo9000 It goes without saying that all AI generated code has rigorous human reviews. No one is vibe coding directly to production. We're increasing speed of shipping and innovation, while continuing to raise the bar on security.
5:00 PM · May 5, 2026 · 251K Views
434 Replies · 23 Reposts · 719 Likes

That answer is reasonable and very responsible answer - today. But it also reveals the limit of the current model.

If every meaningful AI-generated change still requires a human to inspect, understand, and approve it before release, then the human reviewer remains the throughput constraint. AI may compress the time it takes to produce code, but it does not fully compress the time it takes to trust code. The organization gets faster implementation, but not necessarily faster delivery, which is ultimately what matters. Customers want functional software and not just code in a repo.

For most of the last decade, engineering organizations optimized the generation of code. IDE intelligence, then language servers, then copilots, then fully agentic coders. Each generation made writing cheaper, faster, more parallel. Generation is no longer the constraint.

Shipping code did not get faster at the same rate.

Code review, integration, validation, sign-off, deploy — the entire chain that converts a candidate change into a deployed change — still runs at the speed of the slowest human reviewer in it. An agent can produce fifty PRs in minutes. A senior engineer takes an afternoon to review one. The system runs at the rate of its slowest stage, and we just made every other stage a hundred times faster.

AI gave us code at GPU speed. We still ship at human speed.

Why the human review bar held (& why AI will make it obsolete)

There are good reasons human review survived the first wave of AI assistance. Reviewers don’t just check correctness. They check intent — whether the change matches what was actually meant. They check fit — whether it belongs in this codebase, this architecture, this deployment surface. They catch class-of-error issues that tests don’t catch, because tests didn’t know to look for them. And, in fairness, earlier models wrote crap code.

Co-pilot is marginally useful … when writing tests for very “typical” patterns, it saves me an hour once in a while … so I have a key-bind to toggle it on and off, because it’s just obnoxiously useless when it is out of its depth
Is Github Copilot effective?
Karim Fanous
·
June 4, 2024
Read full story

When the author was another engineer, review worked because both sides shared context. Reviewers calibrated against authors they knew. “Bob’s PRs are usually clean but he forgets edge cases on the auth path.” That mental model is the unwritten substrate of code review.

It breaks the moment the author is a stochastic process.

Agents fail differently than humans do. They produce confident, plausible code that compiles, passes tests, and is wrong in ways human authors usually aren’t — wrong because the spec was ambiguous, wrong because the agent confabulated an API, wrong because it solved a slightly different problem than the one it was given. And wrong because the agent mocked all tests to give a 100% pass rate :) Reviewers cannot calibrate against an author with no consistent personality, no learning across PRs, and infinite stamina.

So reviewers do what risk-averse humans always do under uncertainty. They slow down. Read more carefully. Ask more questions. Throughput collapses. Worse, the reviewer becomes fully utilized as a validator — time spent reviewing is time not spent designing or thinking. Exhausted reviewers waving through a flood of plausible-looking code is not defense in depth. It is a rubber stamp at scale.

There is also something quietly soul-crushing about this future for the human reviewer. Engineers did not enter the profession to audit a machine's output. They became engineers because they liked making things — designing systems, solving problems, seeing their decisions take shape in production. A career spent reviewing an endless stream of agent-generated code is auditing without authorship; the part of the work that drew people into the craft is the part the role removes.

That is why the answer cannot simply be “more review.” More review preserves the old trust model while making the human role worse. The better answer is to move humans out of the repetitive inspection loop and into the work that actually requires judgment: defining intent, setting constraints, shaping architecture, deciding risk tolerance, and improving the validation system itself. The machine can generate the candidate change. The validation layer should produce the evidence. The human should not be the last exhausted barrier between infinite code and production.

So, what’s actually missing?

Validation is missing.

I use the word deliberately. Testing is a subset. Review is a subset. Static analysis is a subset. Validation is the property of the system that lets you trust a change without reading it.

In Preparing for a world where humans don’t write code, I argued that trust in autonomous systems cannot come from inspection. It must come from validation — scenarios, invariants, simulations, and continuous observation of outcomes. That argument has to become operational.

A validation layer good enough to ship agent-written code without a human in the loop is not one thing. It is a layered system, and each layer needs four properties.

It captures intent in machine-checkable form. Most engineering orgs encode intent in prose: tickets, design docs, PR descriptions. Prose only validates when a human reads it. Specifications, schemas, properties, contracts, and invariants validate continuously and without supervision. The closer your intent is to executable, the less of a human you need standing between the agent and production. Formal methods spent thirty years as an academic curiosity in industry; maybe they are the answer? It’s very promising to see startups like Axiom raise significant capital to tackle this problem

It validates against the problem, not the implementation. Most existing test suites are coupled to how code was written, not what it is supposed to do. Agents will rewrite implementations freely; tests that assume a particular structure will pass for the wrong reasons or fail for trivial ones. Property-based tests, scenario simulations, and outcome assertions survive the agent. Implementation-coupled tests don’t.

It runs continuously, not at a gate. Pull-request review is a single gate at the end of generation. Continuous validation is a loop that runs at every step of the agent’s reasoning, every intermediate artifact, every proposed action. This mirrors the architectural shift I described in AI Agents are here. Security isn’t ready. — static authorization gives way to continuous authorization. Same pattern, different surface. Static review gives way to continuous validation, which is a must-have in a world of stochastic software.

It produces evidence, not approvals. A reviewer’s “LGTM” is an opinion. A validation system’s output is a record: which properties were checked, against what scenarios, with what coverage, observed in what production conditions afterward. Evidence is what makes autonomy auditable. Opinions don’t scale.

Most organizations have fragments of this layer — unit tests, a staging environment, feature flags, a few invariants buried in the code. Few have it as a coherent system. None have it operating at a throughput that matches what agents can already produce.

Software is not the first discipline to face this

When systems become too complex for direct human inspection, trust moves into validation infrastructure. Software is not the first discipline to make this transition.

We do not trust airplanes because one brilliant person reads every line of avionics code before takeoff. We trust them because the industry built layers of specification, simulation, testing, redundancy, certification, telemetry, and incident learning. We do not trust chips because someone manually inspects every transistor. We trust them because design verification and signoff became disciplines of their own — verification engineering is a profession now, with tools, methodologies, and a career ladder distinct from design.

Software has been an outlier. Code review as the primary trust mechanism was always a function of how cheap inspection was relative to how expensive code was. AI inverts both. Code becomes abundant. Inspection becomes the bottleneck. The economics that made review a viable trust layer no longer hold.

What replaces it is what other industries already built: a validation layer that turns trust into a system rather than a ritual.

Brian Armstrong is not wrong. No one should be vibe-coding directly to production today. For most organizations, human review is the only validation layer they have. The point is not that the defense is wrong. The point is that “rigorous human review” is a temporary load-bearing wall, and every engineering leader should know exactly when their organization plans to take it down — and what they’re going to put in its place.

The next era of software will be autonomous. The constraint on that era is not whether models can generate code. They can. The constraint is whether organizations can validate fast enough to ship what gets generated.

Whoever solves validation ships at GPU speed, while everyone else ships at human speed and calls it rigorous review.

Related reading:

AI Agents are here. Security isn’t ready.
Preparing for a world where humans don’t write code
Competing in a world of AI abundance
Claude Code is the Inflection Point — SemiAnalysis
AI Will Write All the Code. Mathematics Will Prove It Works — Menlo Ventures

Cu(m^2)ulative

Is Github Copilot effective?

Discussion about this post

Ready for more?