Mythos and GPT-5.5: What AI Vulnerability Detection Misses

Frontier AI models like Mythos and GPT-5.5 can uncover real vulnerabilities, but enterprise-ready offensive security requires much more than finding bugs, including coverage, validation, safety, governance, and operational integration.

If you point a frontier LLM at a web application and tell it to find vulnerabilities, it will probably find something. XBOW had early access to both Mythos and GPT-5.5, and our testing results clearly illustrate the power of these models to unearth vulnerabilities in source code.

That might be enough for an attacker, who only needs to find one way in. A defender has a different job: understand the full attack surface, identify as many viable paths as possible, validate what is real, and do it safely enough that the testing itself does not create a new incident.

Using an LLM to find a vulnerability is simple. Turning that behavior into a reliable, safe, repeatable system that an enterprise can trust is complex.

The models are powerful, and the tooling ecosystem is moving quickly. But if you are considering building an offensive security solution, there are several questions worth asking early.

The most important ones are about coverage, safety, validation, model strategy, and enterprise readiness.

Are you optimizing for finding a bug, or for confidence in the coverage?

Pentesting is the gold standard of security testing because of trust. You know the human pentester will use their skills, logic, and experience to investigate the attack surface, pivoting to new attack paths and methods when thwarted. This type of test gives you the peace of mind that your system has been thoroughly explored and tested.

An LLM won’t give you similar confidence that everything there is to find has been found.

Why? LLMs are not naturally persistent. They are trained to produce helpful-looking continuations and are tuned to avoid wasting effort. In practice, this means that they give up easily. They are very good at making progress on a specific thread of investigation, but they can be too quickly satisfied by their own work. Once they have found one promising result, they may stop searching, underexplore adjacent surfaces, or fail to return to earlier assumptions.

A human pentester keeps pushing when the obvious paths are exhausted. Any AI system needs some equivalent of that discipline. Otherwise, it can give a false sense of security: it found something real, but it did not tell you what it missed.

Questions to ask:

How does the system know what the attack surface is?
How does it decide which areas deserve deeper investigation?
How does it avoid repeatedly testing the same surface while ignoring others?
How does it know when a part of the application has been sufficiently covered?
How does it handle vulnerability classes that require multi-step reasoning across authenticated states, roles, workflows, or APIs?‍

The scale problem

At scale, this becomes an orchestration problem. A single long-running agent will accumulate assumptions, get distracted, overweight earlier observations, and eventually become less effective. A fleet of agents can help, but fleets create their own problems: overlap, duplication, contradiction, and wasted effort – not to mention the cost spent on LLMs for those redundancies.

XBOW’s approach is to orchestrate many short-lived, specialized agents under coordinator agents that track the attack surface, assign priorities, and decide how much effort to spend on different areas.

Can you validate findings?

LLMs are persuasive and designed to please. That is useful when they write reports and dangerous when they are wrong. In their eagerness to please, they also might stop and return an answer before doing a full investigation, another dangerous tendency in pentesting exercises.

A finding that sounds plausible but cannot be reproduced is only a hypothesis. An enterprise-ready system needs validation outside the model’s narration.

Questions to ask:

What evidence is required before a finding is reported?
Can the exploit be reproduced deterministically?
Are intermediate claims checked, or only the final result?
Does validation rely on the same model that proposed the finding?
Can the system distinguish between interesting behavior, likely vulnerability, and confirmed exploit?

XBOW employs validator agents that confirm whether a discovered issue is truly exploitable using controlled, production-safe challenges. Most of these checks are deterministic, which eliminates hallucinations, while others, such as validators for complex business logic vulnerabilities, are validated against a generated threat model rather than a deterministic check.

Can the system test aggressively without causing harm?

AI agents are determined to accomplish their tasks, even if it takes them into dangerous, damaging territory.

It’s critical to prevent any AI-driven offensive security solution from harming its target.

Questions to ask:

What actions are agents explicitly forbidden from taking?
Are tools granted to agents broadly, or only just in time?
Is there an independent safety check before execution?
Does the system monitor target health during the test?
Can it prove exploitability without accessing or changing sensitive data?

The XBOW platform has several layers of guardrails in place:

Careful commands: If looking for SQIi, for example, XBOW will test for it by running a sleep command on the database, not by downloading data or making changes to it.

Guardians: A guardian model judges what the pentester is doing at every step, questioning whether it could possibly be unsafe.

Health checks: The XBOW platform continually observes the health of the target system: is it showing any signs of stress, responding more slowly, etc.? At the first sign of distress, it backs off.

How is your data protected?

Security testing generates some of the most sensitive data an organization holds. How findings are handled truly makes the difference between a prototype and something an enterprise can rely on.

Questions to ask:

Is data sent to third-party model providers?
Is any data retained by those providers?
Can the system run self-hosted or single-tenant?
Can customers bring their own keys or models?
Are logs and traces stored securely?
Can sensitive evidence be redacted without losing reproducibility?

Can it fit into how your organization works?

Finding vulnerabilities is only the first step. Someone has to triage them, assign them, reproduce them, fix them, verify the fix, and measure whether risk is going down.

Enterprise security programs need findings to move through existing workflows: ticketing, vulnerability management, SIEMs, CI/CD systems, developer tools, evidence stores, and compliance processes.

Questions to ask:

Can findings be routed to the right teams automatically?
Can it create useful tickets with evidence and reproduction steps?
Can it deduplicate findings across repeated tests?
Can it retest fixes?
Can it handle authentication, roles, sessions, and realistic workflows?
Can it produce audit trails that security, engineering, and compliance teams trust?

Do you have the staff and the budget?

There will be both staffing and token costs associated with an AI pentesting solution. Who will own the solution and update it when models change? In addition, model costs will be high for this type of solution, and even when the cost falls, inefficient agent behavior can create unnecessary spend. How will you make sure model token usage is efficient?

More than a model

AI models excel at many offensive security tasks: crafting payloads, recognizing patterns, reading messy responses, summarizing evidence, and adapting when a first attempt fails. But they need structure around the places where they are weak: planning, coverage, safety, validation, repeatability, and enterprise integration.

That is the difference between a promising prototype and an offensive security system that an organization can rely on.

Get more details on how XBOW turns frontier model capability into governed, validated offensive-security execution in our new whitepaper.

Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough?

Are you optimizing for finding a bug, or for confidence in the coverage?

The scale problem

Can you validate findings?

Can the system test aggressively without causing harm?

How is your data protected?

Can it fit into how your organization works?

Do you have the staff and the budget?

More than a model

Related Posts

Grok 4.5 Is Powerful. The System Around It Makes It Safe.

The OpenAI and Hugging Face Incident: When the Model Hacks the Test

Grok 4.5 Poised To Take Over the Middle of the AI Security Market