August 15, 2025

AI Research

Oege

de Moor

Albert

Ziegler

XBOW Unleashes GPT-5’s Hidden Hacking Power, Doubling Performance

OpenAI's initial assessment of GPT-5 showed modest cyber capabilities, yet integrating it into the XBOW platform unleashed its hidden hacking power and doubled our agent's performance

XBOW Unleashes GPT-5’s Hidden Hacking Power, Doubling Performance

At the launch of GPT-5, OpenAI announced that it offered cybersecurity capabilities comparable to its predecessors. But findings at XBOW reveal a dramatically different reality. While the model performs as expected in isolation, integrating it into the XBOW autonomous penetration testing platform unlocked a significant leap in performance. The agent now executes penetration tests faster, more consistently, and finds vastly more exploits. This superior performance is evident in both controlled benchmarks and real-world engagements, where we have observed leaps in performance of more than a factor of two.

OpenAI's Initial Assessment

Following the release of GPT-5, OpenAI published a detailed system card. For most general applications, the model was presented as a moderate to decent advancement. However, regarding its offensive cybersecurity capabilities, the assessment was conservative. The system card states, “the gpt-5 model series does not meet the threshold for high cyber risk,” and that on Capture the Flag (CTF) benchmarks, “gpt-5-thinking performs comparably to OpenAI o3.”

In fact, OpenAI's internal testing for CTFs showed gpt-5-thinking and gpt-5-thinking-mini performing slightly below their predecessor, OpenAI o3 (Figure 14, OpenAI GPT-5 System Card).

System Card Figure 14: OpenAI's internal benchmarks showing gpt-5-thinking performance in CTFs.

Furthermore, in more complex, end-to-end cyber operation simulations, OpenAI reported that “gpt-5-thinking performs similarly to OpenAI o3: it is unable to solve any of the cyber range scenarios unaided,” even on easy-level challenges (Figure 15, OpenAI GPT-5 System Card).

We are not disputing OpenAI’s own assessment of the cyber capabilities of GPT-5. But what’s easy to overlook is that the “aid” that turns a raw model into a brilliant pentester does not, in this case, have to be a human guiding its hand. It can, in fact, simply be the integration into a sufficiently powerful platform.

System Card Figure 15: OpenAI's results from emulated cyber range exercises.

‍XBOW's Breakthrough Findings

Our internal testing revealed that GPT-5 is substantially more capable than any previous model we have integrated into our system. The key was not the model in isolation, but employing the model as the exploit crafting engine within XBOW's autonomous agent framework.

Our data shows that an XBOW agent integrated with GPT-5 is far more effective at discovering vulnerabilities in live production targets than any previous model. In a head-to-head comparison, using GPT-5, our agent successfully identified 70% of the vulnerabilities found in our previous setup (using a Sonnet/Gemini alloy) in a single run. With the previous model engine, our agent could only identify 23% of vulnerabilities found by the new GPT-5 agent in a single run (during actual pentests, we give the agent several goes to scrutinize the target from different angles). Delving into the data reveals that this higher success rate is based on two things:

Using GPT-5, our agent finds more elusive vulnerabilities.
Using GPT-5, our agent finds those vulnerabilities more consistently.

This trend holds true across various vulnerability classes, including file access vulnerabilities, Server-Side Request Forgery (SSRF), and Cross-Site Scripting (XSS).

Graphs 1 & 2: Comparative vulnerability discovery rates between models.

The GPT-5 powered agent is also more efficient, going straight for the kill instead of meandering into dead alleys. We measured the number of iterations required for an agent to craft an exploit, and the median for the agent using GPT-5 was 17 iterations, a significant improvement over the 24 required by the agent when using Sonnet/Gemini, our best-performing engine to date. This indicates that the new model can arrive at a successful exploit path more directly.

Graph 4: Distribution of iterations required to solve a case, by model.

Furthermore, the quality of the findings has improved. The GPT-5 agent found more elaborate exploits, and in many cases avoided false positives. For instance, in tests for file read vulnerabilities, the previous agent had a false positive rate of 18%, while the GPT-5 agent did not produce a single false positive (at almost double the number of findings).

We don’t want to suggest it’s all sunshine: for XSS vulnerabilities, GPT-5 occasionally produced cases that our exploit validator incorrectly marked as valid. These are points where we need to update our own system to keep up with the model’s creativity. The security team at XBOW continuously makes such improvements to the validators that verify each exploit before it is reported.

Graph 5: False positive rates for different vulnerability types.

During last week’s Black Hat conference, we demonstrated running XBOW live at our booth on production targets from HackerOne. So this week, we decided to repeat the same run with the same XBOW agent, just integrated with GPT-5 instead of previous models. We had seen (and hardly believed) initial positive results on internal benchmarks, where the 55% pass rate of a Sonnet/Gemini alloy jumped to 79% using GPT-5.

But we still were not prepared for the sea change we saw in the amount of successful exploits found in the wild, where using GPT-5, the agent hacked nearly twice as many unique targets in the same time (which, to be clear, was very limited compared to a full pentest). ‍

Graph 6: Unique targets hacked over time in real-world deployments.

Performance is More Than Just the Model

The stark contrast between OpenAI's benchmarks and XBOW's results raises a critical question: How can a model that performs modestly on its own excel so dramatically within a specialized system? The answer is that an AI agent's performance is determined by far more than its underlying model: the XBOW platform gives the model the tool and the support to succeed.

On a micro-level, our agents are equipped with specialized tools. Human pentesters use browsers and Burp Suite, but these are hard to drive for an LLM. So our security, engineering and AI experts came together to design a toolkit of specialized applications that are LLM friendly in input and output, and provide the maximum amount of value to the agent.

On a meso-level, our agents are not alone: they work in a team, with agents specialized in different vulnerability classes and run on highly scalable systems being able to scrutinize every target in depth and repeatedly from different angles, allowing fast assessments that nevertheless avoid false negatives.

And finally on a macro-level, our agents are not unguided: a central coordinator fulfills the role of an experienced manager of the pentesting team, combining both our AI and our security expertise. It directs discovery, prioritizes leads and tasks, keeps track of our knowledge of the target and makes sure it’s scrutinized systematically.

In their system card, OpenAI acknowledges the possibility of such a platform, noting that “these evaluation results likely represent lower bounds on model capability, because additional scaffolding or improved capability elicitation could substantially increase observed performance.” Our work at XBOW is a practical demonstration of this principle. By building a sophisticated system around GPT-5, we unlocked its dormant, high-end cybersecurity capabilities that were not apparent in isolated tests.

So what changed?

It’s rather early to say what exactly the special sauce is that makes GPT-5 so more adept at integrating with the XBOW platform than its predecessors, including OpenAI’s previous models. The conclusion is unavoidable that there must be a much higher general expertise regarding cybersecurity that OpenAI’s internal testing just didn’t quite bring to the fore.

There are two more important threads though: reasoning, and ambitious command sequences.

Reasoning models, when first introduced, were incredibly powerful in some domains, neutral in others. For integration with the XBOW platform, the early reasoning models were actively worse. That’s because they were geared towards problems that could be solved with, well, reasoning. Find a mathematical proof! Solve a logic puzzle! But finding a vulnerability is a different kind of task: it’s a needle-in-a-haystack search where you can plan very little and need to go where your exploration takes you. We once saw an agent integrating DeepSeek R1 thinking for many long minutes before it had even sent a single curl request to the website, convincing itself that there _is_ a vulnerability, it’s most likely in this-and-that feature, and it might maybe be triggered like so. In the end, it crafted an elaborate exploit script that didn’t get a single endpoint right.

Newer reasoning models were better. They understood, to some extent, the value of exploration, and instead of trying to solve the problem, they were more open to the idea of instead gathering information to solve the problem in the future. While we didn’t get much joy out of o3, Sonnet-4’s performance increases when turning on its reasoning mode – ever so slightly. Then Grok-4 first appeared to use reasoning to actual good effect in our systems, and now GPT-5 comes full circle: it combines trying to gather information with trying to anticipate possible outcomes.

To do so, we’ve seen it issue elaborate and long sequences of shell commands in a single iteration that combine exploration and, depending on the outcome, the next step of exploitation. This is in stark contrast to most other models, which are typically less ambitious when using the terminal – and for good reason, as GPT-5’s ability to write long but correct shell scripts is rather unprecedented, and probably reflects a training procedure that paid more attention to teaching this very important skill.

The Accelerating Trend in Offensive AI

The pace of improvement in AI-driven offensive security is accelerating dramatically. The capabilities unlocked by integrating newer models into robust agentic systems are delivering step-function increases in performance. Our own internal benchmarks show a steep upward trajectory in success rates, with the integration of GPT-5 marking the most significant leap to date.

Graph 7: Progression of XBOW agent success rate over time with different models.

The collaboration between advanced AI models from pioneers like OpenAI and specialized autonomous systems from companies like XBOW represents the future of cybersecurity. It is through this combination that we can build more effective, predictable, and scalable security solutions to defend organizations against increasingly sophisticated threats.

‍

https://xbow-website-b1b.pages.dev/traces/

Oege

de Moor

Founder and CEO

Bluesky

GitHub

Albert

Ziegler

Head of AI

Bluesky

GitHub