Before You Replace Your Manual Testers, Read This
How development managers should think about the mix of AI and manual testing.
The first testing engagement I led was in 2009, at the Social Security Administration. The process was straightforward enough. I’d write a test case in a document, validate the test with the engineer, open HP’s QuickTest Pro, walk through the steps, narrate what I was doing and what I expected to happen, and save the recording as evidence that the test had been performed and had passed or failed. If the test needed to be re-run after a code change, I re-recorded the whole thing. Every time, for every test, for an entire year before the contract was done and I ran away as fast as I could.
That was the state of the art in a large federal agency sixteen years ago. I remember thinking, even then, that there had to be a better way. I was right, but the better way took longer to arrive than I would have guessed, and the organizations that were still doing some version of that work in 2020 were not outliers. They were the norm. The truth is, there are just as many organizations running this way in 2026 as there were in 2020, and probably now many fewer than 2009. If I had to guess, the SSA isn’t much better off nearly 20 years later.
I bring this up because the current AI-for-testing conversation tends to skip past the long arc. The conversation I have been having with clients is usually something like “we have an AI mandate from the board, and the CIO thinks we need to start with the manual testing.” What many dev managers are thinking right now is some version of: point the model at the codebase, let it generate tests, watch your manual QA footprint shrink. Too easy, right?
So, here’s the honest version of the problem. The technology is further along than most people give it credit for, even if your organization and dev infrastructure are not. Model-generated unit tests, AI-assisted exploratory testing, self-healing selectors, natural-language test authoring, flake triage, visual regression that doesn’t fall over when a padding value changes by two pixels. All of it works well enough to move numbers in any program. I’ve watched teams cut their manual regression footprint by meaningful percentages inside a single quarter with tooling that was considered experimental eighteen months ago.
And yet most of the dev managers I talk to are sitting in the same spot. They know the capability is available, they know the ask from leadership is founded, and they can’t move forward along a path that feels right. Every time they map the AI capability onto the existing testing practice, one of two things happens. Either (1) they end up with a proposal that looks like replacing the QA team, which they don’t actually want to do and can’t defend politically. Or (2) they end up with a proposal that looks like adding AI tools on top of the existing QA team, which doesn’t change the economics enough to show meaningful ROI.
Getting It Right
A few years ago my team started working with a large logistics company based in Memphis. When we came in, their manual regression process was so heavy that they could only safely release new code about every three years. Three years. For a business whose operations were increasingly dependent on the software keeping up with what the warehouses and the planes and the drivers were doing. The testing wasn’t the only reason for the release cadence, but it was the load-bearing one. Every release required a full manual regression pass, the pass took months, and by the time it was done, the next batch of work had accumulated into another multi-year cycle.
A three-year release cadence wasn’t a testing problem. It was a business problem that manifested as a testing problem.
We’ve been working with them for a while now, and the cadence has come down dramatically. Some teams can even release multiple times a day. Other parts release monthly or weekly depending on risk profile. The manual regression suite still exists, but it’s targeted at the parts of the system where the cost of getting it wrong justifies the time, and everything else is handled through automated pipelines with AI-assisted test generation, triage, and coverage analysis layered in.
The point of that story isn’t the tooling. It’s the compounding cost of not modernizing. Three years between releases wasn’t a testing problem. It was a business problem that manifested as a testing problem, and it took a serious investment in engineering practice, CI/CD discipline, test data management, deliberate reduction of technical and organizational debt, and culture change to unwind. AI was eventually part of the answer, but it was never the whole answer, and it wasn’t even close the first thing we touched.
The Art of Manual Testing
Before you can decide what to automate, you have to be honest about what your manual testers are doing, because the job title “manual tester” covers at least four different jobs.
The first job is executing predictable test scripts. Somebody wrote the script, the tester runs through it, the tester files the defect. This was exactly what I was doing in 2009, and it’s the job most vulnerable to AI. It’s deterministic, it’s repetitive, it’s the thing your QA lead has been trying to automate for five+ years anyway. If this is a large chunk of your manual footprint, you have a ton of opportunity.
The second job is exploratory testing. The tester gets a feature, forms a mental model of what could go wrong, and goes looking for it. This job is not primarily about executing a script. It’s about knowing where to push. AI can support this work, and is starting to generate plausible exploratory scenarios, but the thing that makes a good exploratory tester good is the accumulated judgment about this specific product, this specific customer base, and the specific ways users get themselves into trouble. Replace it with a generic model and you lose the edge cases that only get caught because someone remembers the 2023 incident with the export tool.
The third job is environmental and integration testing. The person who knows that the staging environment’s cache behaves differently from production, that the nightly ETL needs to finish before the API tests run, that the third-party payments sandbox will sometimes just stop working on Thursdays. This is knowledge of the seams, not the code. AI can be taught some of it, given enough observability data, but the person currently doing it is doing it partly through pattern recognition and partly through relationships with the people who run those systems.
The fourth job is the quality conscience of the team. The tester who pushes back in sprint planning. The one who asks “are we sure this is a good idea” before the architectural decision gets locked in. Every team I’ve worked with that had a healthy quality posture had at least one person playing this role, and none of those teams called it out as a role. It just happened.
If you walk into the AI-for-testing conversation without naming which of these four jobs you’re trying to automate, you will end up optimizing the first one, congratulating yourself, and losing ground on the other three without noticing.
This is a modernization failure mode that shows up across every category of work, not just testing. The measurable parts get better. The load-bearing but invisible parts erode.
Where AI Wins
I’ll name the capabilities I’ve seen move the needle in real engagements, with the honest caveat that the list is moving fast enough that any blog post is a lagging indicator.
Generated unit tests for new code, written alongside the developer as they go, are now good enough that most teams should be using them. Coverage goes up, the dev is not pulled out of flow to write them, and the tests themselves are usually fine. The failure mode is that generated tests drift toward testing the implementation rather than the behavior, so you get brittleness if you don’t review with an eye for it.
Generated integration and end-to-end test scaffolding, from a feature description or a user story, is getting close to useful. What the model writes is usually 70 to 85 percent of what you want, and the last bit is fixing the selectors, the timing, and the assertions that only a human familiar with the actual UI will catch. Net time savings are real, but the handoff still matters.
Flaky test triage is one of the clearest wins. Large codebases accumulate flakes the way old houses accumulate small leaks, and manually classifying which ones are product bugs, which ones are environment issues, and which ones are bad tests is a terrible use of senior QA time. AI does this well, and the payoff in signal quality is larger than the payoff in hours.
Visual regression, self-healing locators, and log-based defect clustering have all quietly crossed the threshold from novelty to default. If your current testing stack doesn’t include some version of these, the AI conversation is secondary. Get the fundamentals modern first.
Natural-language test authoring, where a product manager or business analyst can describe a scenario and get a runnable test, sounds further off than it is. I’d caution against rolling it out broadly before the engineering org has a clear opinion on test ownership, because it tends to create tests that are hard to maintain and nobody wants to own.
AI-assisted exploratory testing, where the model proposes edge cases based on the codebase and the requirements, is genuinely useful as a second opinion. It is not a replacement for the human doing the work. It’s a way to stretch what the human covers.
The Bottleneck
I’ve worked on QA practice across a lot of different industries and organizations, and one thing shows up in nearly every engagement: test data. The quality of the test data, the availability of the test data, the speed at which realistic test data can be provisioned into a non-production environment, and the degree to which that data can be trusted to reflect production conditions without leaking sensitive information. This is the bottleneck.
It’s a bottleneck that doesn’t get included in the AI-for-testing pitch, because test data is unglamorous, and because the platforms selling AI test generation mostly assume you already have good data to test against. Most organizations don’t. They have stale snapshots from months ago, partially masked in ways that break referential integrity, sitting in an environment that was last refreshed when somebody cared enough to do it manually.
What this means in practice is that you can generate all the tests you want with AI, and your test pass rate is still going to be dominated by environmental and data issues rather than by actual product behavior. The signal gets drowned out. The QA team spends its time doing forensics on why a test failed, not on whether the product is good.
The encouraging development is that AI is genuinely useful on this problem, and in my view the test data angle is where the next wave of real value is going to come from. Synthetic data generation that preserves the statistical shape and referential relationships of production. Privacy-preserving data synthesis that lets you test realistic scenarios without carrying PII into lower environments. Context-aware subsetting that pulls the slice of production state relevant to the feature you’re testing. Automated drift detection between production and test environments so you find out before your test run that the schema has moved. These are not hypothetical. They are in reasonable shape today and improving quickly.
If I were running a dev org and I had budget for one AI-in-testing initiative this year, and my test data hygiene was typical, I would spend it on test data before I spent it on test generation.
The generation problem is mostly solved. The data problem is the multiplier, and fixing it makes everything else you do downstream better.
A Note on CI/CD
I’ve spent a good amount of time over the last several years working to advance the body of knowledge around CI/CD practices in large enterprises. Contributing to how frameworks think about continuous delivery, what mature CI/CD looks like at scale, and how engineering practices connect to portfolio-level outcomes. Some of that work has shown up in SAFe guidance. Some of it has shown up in the way we serve clients at TSG.
One thing I’ll say from that work, which applies directly to the dev manager reading this post: AI-in-testing is a CI/CD conversation, not a QA tools conversation. The organizations that get the most value from AI-assisted testing are the ones that already have a functioning delivery pipeline. Trunk-based development or something close to it. Operationalized code review practices. Environment parity that isn’t a pipe-dream. Feature toggles that are used in-practice. Observability that lets you catch problems fast when they escape into production. When those fundamentals are in place, AI in the testing layer compounds quality. When they aren’t, AI in the testing layer is a faster way to produce junk, or as my friend Steve Adolph says, “garbage in, landfill out.”
This is why I’m cautious when I see organizations treating AI-for-testing as a discrete initiative, run out of the QA budget, owned by a QA leader. It isn’t a QA initiative, but a delivery modernization initiative that happens to show up most visibly in the testing layer. Framing it that way changes who’s at the table, what the success metrics are, and whether the investment actually produces the economics leadership is expecting.
How It Lands
If you take those four manual jobs and those six AI capabilities and put them in a room, the mix that comes out is not “replace manual testing with AI.” It’s not “keep everything and layer AI on top,” either. It’s something messier and more interesting.
The first manual job, script execution, should largely move to AI-assisted automation over the next twelve to eighteen months if it hasn’t already. This is the part of manual testing that leadership is right to want gone. The work was never the highest use of the people doing it, and it was never a good use of mine in 2009 either.
The second job, exploratory testing, should become a hybrid. Your best exploratory testers get AI tooling that proposes scenarios and runs the routine passes, and their time shifts toward the hard cases, the new features, and the pattern recognition work that the model can’t do. Fewer people doing more valuable work. You’ll likely need fewer exploratory testers than you have today, but the ones who remain should be more senior and better paid.
The third job, environmental and integration testing, needs to be reframed entirely. The manual tester in this role was often carrying knowledge that should have lived in your observability platform, your runbooks, and your platform engineering team. If you’re serious about AI for testing, the upstream investment is in making that environmental knowledge legible. Tests that pass in a broken environment are worse than no tests at all, and AI-generated tests are particularly vulnerable to this because they don’t have the tacit sense of “something’s off today” that a human operator develops.
The fourth job, quality conscience, does not get automated. It gets renamed, elevated, and kept. The best version of what comes out the other end of a thoughtful AI-for-testing program is that you still have that person, they’re now a quality engineer or a senior SDET or whatever your org calls it, and the ratio of their time spent on judgment versus execution has gone up substantially.
Going Live
A few notes for the dev manager who has to run this.
Start with what you can measure honestly. The cost of your current testing practice is usually understated because it includes a lot of implicit work. Before you commit to a target, spend a couple of weeks with your QA leads getting an honest picture of where the hours go. You will be surprised. Almost everyone is.
Don’t set the adoption target in hours saved or headcount reduced. Set it in defect escape rate, time-to-feedback on a pull request, release cadence for the parts of the system that should be releasing frequently, and coverage on the parts of the codebase that actually change. Hours saved is a lagging, politicized metric. The others are leading indicators of whether the program is working. The Memphis logistics company we worked with didn’t move from three-year releases to daily ones by targeting hours saved. They targeted cycle time, and the hours came with it.
Treat the manual testers you have as the most valuable input to the program’s design. The person who has been running the regression suite for four years knows more about where the bugs hide than any model will know for the next several years. Bring them into the design of the AI-assisted replacement, and pay them as the senior engineers they will need to become. The organizations that quietly win this transition are the ones that retain and retrain. The ones that treat it as a staffing exercise lose the knowledge and then spend three years rediscovering it through production incidents.
Fix test data before you scale test generation. I cannot say this loudly enough. If your lower environments are unreliable, AI-assisted testing will amplify the unreliability, not resolve it.
Build the rollout in phases that assume the first version of the AI-assisted practice will be wrong in ways you can’t predict. Generated tests will occasionally be confidently wrong. Self-healing locators will occasionally heal themselves into testing the wrong thing. Flake triage will occasionally mark a real bug as a flake. You need the feedback loops to catch these, and you need them before the program is declared complete, not after.
And accept that “done” is the wrong frame. The tooling will keep changing. What your model is good at in the spring isn’t what it’ll be good at in the fall. The dev managers who get the most from this are the ones who build a practice of reassessing the mix every couple of quarters, rather than running a single transformation and declaring victory.
Closing Thoughts
The manual testers in your org are not the problem AI is solving. The problem is that testing in most organizations has been under-invested, under-staffed, and under-respected for a long time, and the recent AI capability gives you cover to fix that. Not by removing the humans. By moving the work they do up the stack, automating the parts that were never a good use of them, fixing the test data infrastructure that has been quietly taxing every program for a decade, and paying for the judgment that’s left.
I think about the 2009 version of me, recording those screen captures of my own test runs, and I don’t romanticize any of it. The work was tedious and the feedback loops were terrible and the profession deserved better tooling. What the 2009 version of me had, though, was a slowly accumulating sense of where the system was going to break, built from the hours of sitting with the product. That’s the thing I’d want to protect as you run your program. Not the tedium. The judgment that grew out of it.
The dev managers who get this right won’t be the ones with the most aggressive automation targets. They’ll be the ones whose quality posture, twelve months from now, is obviously stronger than it was before the program started, and whose best testers are still on the team.
That’s a harder program to sell in a single slide, and it’s also the mix that works.


