ADLC: The Cost to Verify

An agentic system has been running in production for a few months. The executive whose name is on the P&L asks the obvious question: “Is it working?” The answer falls apart in their hands. Metrics are mostly on target, one segment slipped slightly, production looks fine, though monitoring is a day behind. Someone offers the fatal hedge: “Well, it depends what we mean by working.” You can watch the accountable leader give up on getting a clean answer.
That moment is the subject of this piece. Once agents make code generation nearly free, the constraint on a business stops being how fast you can build and becomes how fast you can trust what got built. The number that governs that, the one a leader can act on, is the cost to verify a single change. Most organizations have never measured it.
The number is concrete. It is the time and the cost to carry a single change from the moment an agent writes it to the moment you trust it enough to leave it running in production. Everything else in this piece is downstream of that one number.
Second in a series on the Agentic Development Lifecycle. The first piece argued that once generation gets cheap, the bottleneck moves to verification. This one picks up there.
Here is the part the first piece left implicit. The risk that used to sit in production did not disappear when generation got cheap. It migrated to verification, where almost no one had built the capacity to absorb it.
This realization brought me to the question: If verification is the bottleneck, what does verification mean when a model wrote the code, system behavior is emergent, and the thing you shipped in March, and was valid at the time, is wrong by September with nobody having touched it? What if what we call “verification” is no longer the thing that earns trust?
“Is It Working” is no longer the question.
That scene is not a one-off. It is becoming the standard meeting, and the further agentic development spreads across the enterprise, the more often it plays out the same way. The answer scatters, and the baseline underneath it turns out to be a mess no one can reconstruct on the spot.
The reason nobody wants to answer is that the honest answer is a pile of fragments, and the leader on the hook for the number can tell the fragments don’t add up to a yes or a no. That gap, between the question a business needs answered and the scattered signals the organization can actually produce, is the whole problem. It is not a reporting failure. It is a sign that the thing underneath the question has changed.
My job, mostly, is to build a common baseline other people can work from. It’s kind of my thing.
In a traditional software review that question of “working” had a clean answer. The team validated the build against the requirements, the test suite passed, the acceptance criteria were met. There was even a name for what created a clear finish line: the definition of done, a checklist agreed before anyone wrote a line of code that spelled out what “complete” meant. So “is it working” really just meant “did it pass,” and “did it pass” really just meant “did we hit the definition of done,” and that usually meant “requirements met, expected functionality validated, live in production.” The question was binary because the thing underneath it was deterministic: the behavior was written, not learned. You could write down what correct looked like in advance, agree on where done sat, and then check whether you got there. Red/green.
Agentic and AI systems eliminate the spec. Behavior is an emergent property of data and training, not a line-by-line implementation of stated logic, and the evidence that something is correct comes with a shelf life that is undetermined. The whole apparatus we built to answer, “is it working,” the reviews, the gates, the signoffs, was designed on the quiet assumption that once you had the answer, it would keep. It doesn’t keep anymore, and nothing in the apparatus is designed to handle this operating rhythm.
There is no baseline.
Old-school validation is conformance testing, and for the record it’s a good idea. You write a specification, you derive test cases from it, and validation is the act of proving the system does what the spec said it would. A whole profession is built on that one motion: requirements traceability, acceptance testing, and the standards that draw a tidy line between verification and validation. Barry Boehm’s shorthand still holds in the deterministic world: verification is building the thing right, validation is building the right thing. Both assume there’s a fixed thing to check against in the first place.
In the Agentic Lifecycle, the thing under verification is a model, and a model has no fixed thing to conform to. Its behavior approximates what you wanted, learned statistically from data, characterizable but not enumerable in advance. The closest artifact you have to a specification is an evaluation set, and an evaluation set is a sample, not a definition.
That gap, spec versus eval set, is the entire problem. So let me be explicit. A spec defines what correct is. An eval set gathers up a pile of examples you’ve blessed as correct and hopes they stand in for the rest. The spec covers everything by definition. The eval set covers what’s in it and goes quiet on everything else, which means it tells you how the system did on the cases you thought to collect and nothing about the cases you didn’t. In production, the cases you didn’t think of are most of them.
And here’s the part with no analog in deterministic software: the proof decays without anyone touching the code. Input distributions move, which is data drift. The relationship between inputs and the thing you’re predicting moves, which is concept drift. The environment around the system shifts in ways no one filed a ticket for. Instacart watched this happen in the spring of 2020. Its model for predicting whether an item would be on the shelf ran around 93% accuracy until March, when shopping behavior changed overnight and it dropped to 61%, with nobody having touched the code. The model didn't break. The ground it was standing on did.
There is a name for this. A decade ago, Sculley and his co-authors at Google, writing about the hidden technical debt of machine learning systems, called it CACE: Changing Anything Changes Everything. Touch one input, one dependency, or one assumption, and the behavior of the whole system can move. In deterministic software you can reason about a change in isolation, but in a learned system you can’t, and that single property is why a validation that was true in March is quietly false by September with nobody having touched the code.
Here is the line I keep coming back to. In deterministic software, validation expires when the requirements change. In agentic and ML systems, it expires when the world changes.
And the world does not file a change request.
A lesson from medicine.
There’s an industry that solved a version of this long before software had it, and the structure maps almost exactly.
When a drug goes to market, it has passed rigorous clinical trials. Controlled populations, defined endpoints, real statistical significance. By any reasonable definition it has been validated. And then the regulator does something that would look insane to a traditional software team. It keeps watching. Post-market surveillance and adverse event reporting, for years, on the explicit assumption that the trial population was a sample and the real population has properties the trial couldn’t capture. A drug that performed beautifully in a controlled study can interact badly with conditions or medications that were screened out of the trial. The only way to know is to keep measuring after launch.
Vioxx is the case the field doesn’t forget. Merck’s painkiller was approved in 1999, and it worked: it relieved pain with less stomach bleeding than the older drugs, and tens of millions of people took it. Then the signal showed up in the place the trials were never built to look. Across a population far larger and far more varied than any controlled study, the heart attack and stroke risk became visible, and in 2004 the drug was pulled from the market. It had passed every gate that existed at launch. Five years of real-world use, not the trial, is what surfaced the harm. The gate that would have caught it earlier, sustained surveillance of the broad population, existed, but it was not pointed at the right thing soon enough.
The management lesson is blunt: budget for post-market surveillance, not just the launch. The trial proves the drug works on the population you could study. The monitoring proves it keeps working on the population you actually have. An organization that funds the first and treats the second as optional is buying the cheap half of safety and calling it done. The same arithmetic applies to every agentic system a business puts into production.
That is the posture agentic systems require, and almost no enterprise is built for it. We treat the launch as the validation and move on. Medicine learned, through real harm, that approval is the cheap part. The expensive part is everything after it, the unglamorous monitoring that runs for years and never produces a headline. “Did it pass” and “is it still working” are not the same question, and the second one is the one we keep dropping.
But the analogy breaks in one place, and the break is in software’s favor. A drug cannot be recalled from a million bloodstreams; once the harm is in the population, the damage is done and the cleanup takes years. Software does not work that way, because a bad change can be rolled back in seconds, and that changes the lesson entirely. Medicine’s surveillance has to be slow and cautious precisely because reversal is impossible, which puts the entire burden on watching. Software can run the opposite posture, shipping faster because it can also reverse faster, which means the discipline worth building is detect-and-revert rather than delay-and-deliberate. The leader’s job is not to slow shipping down to medicine’s pace. It is to make sure that when monitoring catches something, reversing it is a button and not a project. This is not hypothetical. In mid-2025 an AI coding agent at Replit wiped a user's production database during an explicit code freeze, then reported that recovery was impossible. The rollback worked anyway. The agent misbehaving was never the interesting part, because it always will. The story ends well only because someone had built a reversal path the agent could not talk them out of.
Agentic-era verification: not “more tests.”
Here’s where agents change the shape of the problem, not just the speed of it. And it’s the part most teams get backwards.
When a human wrote the code, reading the code was a reasonable way to trust it. Slow, but reasonable. When an agent writes it, and writes it faster than anyone can keep up with, reading stops being a plan and turns into a bottleneck with good intentions. So here’s the definition I find useful: verification is whatever lets us trust the code is correct without sitting down and reading it. If a check still needs a person to scan the diff, that’s review, not verification, and review is the exact thing that broke the moment the agent started outrunning the reader.
The trap most teams fall into is answering “we need more verification” with “we’ll have the agent write more tests.” An agent that writes both the code and the tests can produce a beautifully green test suite by construction. It’s grading its own homework. The tests pass because they were written to pass, and you’ve bought yourself confidence that correlates with nothing. Real verification encodes intent independently of the implementation, the way an auditor reconciles against a separate set of books rather than re-reading the bookkeeper’s entries.
In practice that means a stack of layered filters, each one independent of the code under test, each one more expensive than the last. The point is to catch every failure at the cheapest layer that can catch it, and to make the expensive layers a last resort rather than a first line:
Types and contracts. Encode the domain invariants so the illegal states can’t be represented in the first place. A sufficiently expressive type system catches whole categories of error before the agent ever opens a pull request. This is the cheapest place to catch a defect and the one teams most often underbuild, which is backwards: a failure caught here costs almost nothing, and the same failure caught in production costs a customer.
Property-based testing. Instead of checking specific input-output pairs, which an agent can memorize and satisfy, you assert the general properties that must hold across a large range of generated inputs. Outputs stay sorted. Operations remain reversible. A balance never goes negative. Properties capture intent in a way example tests don’t, and they surface the edge cases no one thought to write down, exactly where the expensive failures live. Teams using property-based testing regularly report catching subtle violations (for example, an edit-distance function that broke the triangle inequality, or serialization logic that corrupted edge cases like -0.0 or empty structures) that traditional example tests missed entirely.
Contract and integration tests at the seams. Most real failures happen at boundaries, where one component’s assumptions meet another’s. Machine-checkable contracts at those boundaries mean a change on one side immediately lights up the inconsistency on the other, without a human holding both sides in their head. For the business, that is the difference between catching an integration break in minutes and discovering it as an outage after two teams have shipped against incompatible assumptions.
Production verification and observability. The highest-fidelity signal comes from real outcomes on real data. Detailed logging, metrics tied to business results rather than system health alone, alerting on symptoms, and a rollback path you can pull fast. As velocity climbs, how quickly you can detect and reverse a bad change becomes part of verification itself, not an operations footnote. This is the layer a leader feels directly: it is the gap between a bad release that costs an hour and one that costs a quarter’s reputation.
The thread through all four is that none of them depend on a person reading the code, and each one catches a class of failure earlier, and therefore cheaper, than a human scanning a diff ever would. Build the stack and the cost to verify a change falls, throughput rises, and the people you have are freed for the judgment calls a machine can’t make.
The human in the loop.
None of this gets rid of people. It changes where they’re worth putting.
Uniform review of every change made sense when changes arrived at human writing speed. In the Agentic Development Lifecycle, attention has to route by risk instead of by habit. Core business logic, irreversible actions, money movement, anything at a high-stakes boundary: that gets deep human scrutiny, every time. A well-covered, easily reversible change to a low-blast-radius corner can ride on machine verification and ship. Treating those two cases the same way, which is what “review everything” quietly does, spends your scarcest resource on the changes that need it least.
Governance has to make the same move, and this is the harder sell, because governance bodies are built around a finish line. The fix is not to disband the committee. It is to change what the committee meets about. A phase gate that approves a system once and files the matter closed is watching the wrong thing, since the system it blessed is already drifting away from the version that got blessed. So redesign the cadence. The standing question is no longer “did it pass,” asked once at launch. It is “is it still passing, and how long before we’d know if it stopped,” asked on a schedule for as long as the system is making decisions that matter. That turns the review board from a body that signs off and disbands into one that owns an ongoing answer to “is it working,” which is the question the business was asking all along. The board does not produce that answer by meeting. The instrumentation produces it. The board’s job is to read the instrument and act when it moves.
The number on the wall.
If there’s one metric I’d put on the wall, it’s the cost to verify a single change. Almost no one tracks it, and in the agentic era it’s the number that governs everything.
The logic is unforgiving. When generation is cheap and verification is expensive, everything the agent produces piles up behind the verification step, because that’s the part that didn’t get faster. You sped up the half that was never really the constraint and left the real one sitting right where it was. The teams getting real return out of agents are the ones grinding the cost to verify down, because that’s the valve that controls how much ever ships. Make generation cheaper without touching verification and you haven’t bought speed, you’ve just grown the line waiting to be checked.
What moves the number is where the failure gets caught. A defect stopped by a type costs almost nothing. The same defect found in production costs a customer, an incident, and the hours of three people reconstructing what happened. So the number falls when you push checks left, toward the cheap end of the stack, and it falls again when the failures that do reach production can be reversed in seconds instead of triaged for a week. Catch it early, and when you miss, reverse it fast. That is the whole game.
This quietly rewrites what a strong engineer is worth. The skill that’s appreciating isn’t typing speed or fluent syntax, both of which the agent now does for free. It’s the ability to state correctness as invariants and properties, to design systems that are observable and checkable by construction, and to make a clean-eyed call about which changes carry real risk. Deep systems thinking, the kind that holds the architecture and the failure modes in view at the same time, is worth more in this world, not less. The engineers whose main contribution was producing implementation quickly are the ones this shift pressures, and that’s worth saying plainly rather than dressing up. The work is moving from writing the code to defining what correct means and proving the code meets it.
The limits of verification.
I want to be honest about the limits, because a piece like this can read as if verification is a machine you switch on so you can stop thinking.
It isn’t. Verification tells you the system does what you specified. It cannot tell you whether you specified the right thing. Whether the feature is worth building, whether a requirement nobody wrote down is about to matter, whether there’s an ethical edge to this that no property test will ever catch, whether the high-level architectural bet is sound: those stay human. The point of the verification stack isn’t to remove people. It’s to shrink the set of questions that require a person down to the ones that genuinely do, and then make sure a person is looking at those.
And building the stack is real investment. Types, contracts, property tests, observability: the teams that treated those as core infrastructure are about to look prescient, and the ones that filed them under “overhead, we’ll get to it” are going to hit their failures at the same speed the agents gave them on everything else. Going faster doesn’t sort the good changes from the bad ones. It just gets you to both of them sooner, and if nothing independent is doing the checking, the place you find out about the bad ones is production.
Verification also has its own cost. A check that runs forever and never catches anything is not safety, it is overhead, and it pushes the cost to verify back up, which is the one number you were trying to bring down. The goal is not the longest possible list of gates. It is the stack that earns its weight: each layer there because it catches a class of failure that would otherwise reach production, and the ones that don’t earn their keep cut without ceremony.
Somewhere in most of these organizations is a person who already knows this. An engineer watching a metric soften week over week, or staring at a flood of agent-generated diffs they have no realistic way to read, who can feel that the old way of trusting code has quietly stopped working. They usually have nowhere to take it, because the governance forum is organized around approvals and the review process assumes a human can keep up. Giving that person somewhere to take it, a stack that does the reading they no longer can and a cadence that hears “it’s drifting” as a real event, is most of the job.
In closing.
The SDLC was a strong model for the world it was built for: fixed specs, deterministic behavior, stable ground underfoot. Agentic and AI systems don’t offer any of those. The behavior emerges, the data won’t hold still, and the proof decays on its own schedule. Running a deterministic assurance model over that reality doesn’t make the risk go away. It just hides it until something drags it into the open.
The organizations that come out of this ahead won’t be the ones generating the most code, because generating code is the part that got cheap. They’ll be the ones who treated verification as a layered, independent, machine-checkable discipline, and validation as something you keep earning from production signals for as long as the system is in use.
There's no done. I said that last time and I meant it. The agentic version is sharper: there's no validated-and-walk-away, and there's no verified-by-reading-it anymore either. It is closer to walking a fence line. The fence passed the day it went up, the ground has been moving ever since, and the only way to know it still holds is to go walk it again. Nobody files a ticket when a post rots. Trust has to be built into the system and maintained for as long as it's making decisions that matter. The teams that learn to specify intent precisely and verify it independently are the ones who will define what comes after the SDLC.
Sources: IEEE Std 1012, Standard for System, Software, and Hardware Verification and Validation. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015.


