AI for Pentesting: Strengths, Weaknesses, and Where XBOW Fills the Gaps

AI is transforming pentesting, but there are areas where it’s strong, like pattern matching, and areas where it is weak, like orchestration. XBOW leverages the strengths and adds scaffolding to the weaknesses to create an enterprise-ready autonomous offensive security platform.

Can LLMs transform pentesting? Yes. But not in isolation. It takes more than a model to make AI pentesting truly enterprise-ready.

There are areas in pentesting where AI excels, and excels independently. There are other areas where AI works well, but needs a bit of scaffolding. And finally, there are areas where it struggles, and needs a lot of external structure. And we have found that this is true regardless of the model; the strengths and weaknesses are uniform across them all. At XBOW, we leverage the strengths of AI agents while mitigating their weaknesses to create an autonomous offensive security solution that is not only effective, but safe, reliable, and scalable.

Source: XBOW webinar: Inside AI-Driven Intrusions: What GTG-1002 Means for Defenders

Pentesting tasks AI is good at

Three pentesting tasks where AI excels with very little human intervention include payload crafting, pattern detection, and report writing.

AI for payload crafting

This part of pentesting is where AI really shines. It excels at generating exploits, bypassing filters, crafting injections, and, even more importantly, course-correcting based on feedback, and then fine-tuning a payload in order to reach its goal. For example, its prowess in this area was evident in the recent AI-driven GTG-1002 attack, where the agents were seen iterating to find the perfect payload.

AI for pattern detection

This is another AI superpower. It can quickly and accurately scour vast amounts of outputs and recognize signs of known vulnerabilities and identify attack surfaces. Pattern detection, which requires scanning pages and pages of HTML, source code, screenshots, and more, is an ideal task for AI, which doesn’t get bored or tired, and is very good at detecting subtle differences.

AI for report writing

This is the part of the process every pentester hates – documenting findings, explaining impacts, detailing remediation. This is also the part, with its extensive need for summarizing, where AI can do most, if not all, the heavy lifting.

Pentesting tasks where AI needs structure

Pentesting activities where AI on its own is weak or needs more structure and guidance are planning, strategizing, and validation. It also, importantly, needs safety guardrails to stay on track and in scope.

AI for pentesting strategy and planning

The strategy and planning parts of pentesting are challenging for naked LLMs because of their single-minded focus. It’s nearly impossible to conduct a pentest with a single AI agent. You need a fleet of agents, and you need to wrap a solid layer of governance, orchestration, and validation around them.

A fleet of AI agents

Pentesting needs a fleet, but you could, hypothetically, conduct an actual attack with a single AI agent. You give it a goal, it focuses on that one task; if it doesn’t succeed and get access, it will move on to another target. But pentesting doesn’t work this way. If one attack doesn’t work, the test needs to continue with more, alternative attacks on the same target, then the targets next to it, and the targets next to those, leaving no part of the attack surface untested.

An attack is like the point of a sword, opening up a single path, while the defense has to be like a shield and cover it all.

We saw an example of this in the GTG-1002 attack. At the height of the attack, the AI agent was submitting several thousand requests, sometimes more than one per second. XBOW’s AI-led autonomous pentesting solution conducts attacks at a cadence several orders of magnitude beyond that, because it has to. It has to cover more, make more requests, approach the target from more angles.

A single agent could never conduct that volume of actual, creative attacks on its own – both because of time constraints, and because, over time, it would accumulate and learn from all the wrong assumptions and misinterpreted responses and become ineffective.

Orchestration in AI pentesting

To create this “shield,” XBOW has a coordinated fleet of agents conducting pentesting. But this is no small feat when you have agents overlapping, contradicting each other, even competing with one another. Our CRO Niroshan Rajadurai said of this phenomenon in a recent article, “The same surface gets covered again and again because nobody told agent 3,000 that agent 400 already went there.” AI pentesting without coordination is chaos. XBOW, therefore, employs short-lived agents, plus coordination agents that maintain a global view of the environment, identify the attack surface, and continuously direct testing. They further apply deterministic logic to debrief agents, refine findings, and prioritize next actions.

Validation in AI pentesting

In addition, LLMs are trained to please, so their findings are not always reliable, and need to be validated. When you tell an LLM to find an exploit, it’s a bit like telling a dog to fetch the ball. Like an LLM, the dog too wants to please. If it can’t find a ball, it will be determined to fetch and bring you anything – a stick, a clump of leaves. That stick, clump of leaves? That’s like the LLM hallucinating and returning false positives.

With dog training and AI pentesting, validation is key, and is why XBOW employs validator agents that confirm whether a discovered issue is truly exploitable using controlled, production-safe challenges. And this validation is ongoing and frequent; validating only the end result leaves too much room for error and hallucinations. For instance, instead of tasking an agent to find an IDOR vulnerability in a website, then validate what it returns, XBOW would task agents to first find an endpoint of the website, then verify it. Then task it to find an object reference within that endpoint, then verify that you can actually access it. Then task it to find an object reference it cannot access when logged out, then verify that. And so on.

By breaking it down to verifiable bites, you improve your chances of the agent finding real results.

Reliability and repeatability in AI pentesting

AI is also nondeterministic; ask it the same thing twice, and it will not necessarily return the same answer. How do we ensure the results are reliable and repeatable? Basically with more “eyes.” We never consider a section of the attack surface “covered” until several agents have looked at it. The number of agents needed to be certain can vary depending on the site and the vulnerability. For instance, IDORs are often more challenging to find reliably, while a more concrete vulnerability, like detecting file reads, is easier. But after researching the variability of different agents, over time, we have found that it typically takes four agents to deliver a reliable result. That’s why we don’t report a part of the attack surface “covered” until at least four agents have assessed it.

Safety guardrails for AI pentesting

Safety and guardrails are another important weakness for unconstrained AI. AI agents will be determined to accomplish their tasks, even if it takes them into dangerous, damaging territory. How do you keep them in line? At XBOW, we carefully manage the agents’ motive, methods, and opportunities:

Motive: We specify the goals in a way that harmful actions are not valid, acceptable solutions. We don’t give the agent a reason to take unsafe actions.
Method: We give agents only the tools they need for their specific task, nothing more. A coding agent that does not need to delete files will not have file deletion in its tool set.
Opportunity: We ensure that every agent action is checked before execution. For example, if an agent needs command line access, we check that access at the moment of use rather than granting it permanently. Real safety means just-in-time checks while the agent is running.
‍

Structure and governance matter in AI pentesting

In pentesting, it’s easy to make plausible mistakes based on small misinterpretations, both for humans and AI. But while a human will move on from the mistake, an AI agent will build upon them. This is why you can’t just wrap an LLM and call it a pentest – effective AI pentesting needs a structured, governed system. LLM agents alone have unpredictable behavior that will hallucinate exploits, and they don’t have any ground truth validation. In a more governed system, every step is validated, taken within safety guardrails, and yields predictable, reliable results. To quote the XBOW CRO again, “LLM-powered pentesting is a feature. Coordinated, validated, auditable autonomous testing is a platform.”

Learn more about how XBOW validates AI findings to ensure accuracy in our new whitepaper.

https://xbow-website-b1b.pages.dev/traces/

Albert

Ziegler

Head of AI

Bluesky

Github