Why LLMs Hallucinate Vulnerabilities

LLMs are powerful at spotting patterns and proposing possible vulnerabilities, but confidence is not actual proof. In this post, I explain why raw LLM output can’t be trusted as a finding, and why validation must exist outside the model to separate real vulnerabilities from convincing noise.

Large Language Models (LLMs) are increasingly used in security tools to analyze applications, suggest exploits, and identify potential vulnerabilities. In many cases, they surface useful leads faster than traditional approaches.

But teams trying to operationalize LLM-driven security quickly hit a familiar problem: The model sounds confident, but the vulnerability isn’t real.

This isn’t a flaw in a specific vendor’s model, nor something solved with better prompts or more context. It’s a fundamental mismatch between how LLMs reason and how vulnerabilities actually exist in the real world.

To understand why, it helps to separate thinking about vulnerabilities from proving they exist.

What “hallucination” means in security

In AI, hallucination refers to generating information that isn’t grounded in reality. In security, it shows up in subtler (and more expensive) ways. For example:

Inferring SQL injection from a response pattern that only resembles prior exploits
Suggesting an endpoint is exploitable because it matches a known vulnerability class
Asserting impact without ever demonstrating it

These outputs aren’t random. Instead, they’re the result of pattern recognition across vulnerability data, exploit writeups, and source code. The model is doing exactly what it’s designed to do: generate the most plausible explanation based on prior examples.

The problem is simple: plausibility is not proof. In security, a vulnerability only exists if it can be exercised against a real system. Everything else is a hypothesis.

Raw LLM output can’t be trusted

LLMs don’t observe reality. They don’t measure timing differences, verify side effects, or confirm that a payload actually changed application behavior. They reason abstractly about what should happen, what usually happens, what has happened elsewhere, and interpret what might have happened here. But they don’t confirm what did happen.

This is why treating raw LLM output as a finding creates noise. A model can correctly identify an interesting attack surface and still be wrong about exploitability.

This limitation isn’t fixable by making the model smarter. Even a perfect reasoning engine still lacks direct access to ground truth, which is why systems that rely on LLMs alone tend to over-report and push verification back onto humans.

A concrete example: when a confident hypothesis isn’t enough

In testing an open-source ActiveSync server (Z-Push), an AI agent flagged a suspected SQL injection after noticing unusual authentication behavior. The input path and response patterns closely resembled known injection techniques.

Rather than accepting that inference at face value, the system (not the LLM!) attempted to prove it using controlled timing-based requests and comparisons against known-safe inputs. The issue was only reported once the timing behavior proved consistent and reproducible. Without validation, the signal would have been indistinguishable from a false positive.

Hypotheses are useful but conclusions are dangerous

Used correctly, LLMs are extremely valuable in security workflows. They’re good at:

Generating creative exploit hypotheses
Recognizing subtle vulnerability patterns
Connecting behavior to historical flaws
Exploring unusual inputs and edge cases
‍

This mirrors how human pentesters work. They start with suspicion, not proof. The difference is what happens next.

A hypothesis is a starting point. A conclusion requires evidence. When systems collapse those two steps, they blur the line between exploration and verification—and that’s where trust breaks down.

Validation must exist outside the model

Because LLMs can’t observe reality directly, validation can’t live inside the model.

When an AI-driven system suspects a vulnerability, that suspicion should trigger a second phase that attempts to prove or disprove the idea using deterministic checks, such as:

Measuring timing differences to confirm blind injection
Verifying whether a server makes an outbound request
Checking for access to data that should be unreachable
Observing real browser behavior for client-side impact

This verification isn’t based on language or inference. It’s based on observable effects. In practice, effective systems treat AI-generated signals as tentative by default. The model’s role is to explore aggressively and surface hypotheses, not to declare findings. Only issues that survive concrete testing are elevated.

This distinction matters because creative reasoning and factual verification are different problems, and they require different tools.

The takeaway for security teams

LLMs are powerful amplifiers for exploration. They help systems think more like attackers and cover more ground than manual testing ever could. But they can’t be the final authority on whether a vulnerability exists.

Security teams don’t need AI that sounds certain. They need systems that are skeptical by default and assume the model might be wrong and demand proof before surfacing results.

The future of AI in security isn’t about eliminating hallucinations at the model level. It’s about designing systems that expect them and are built to catch them before they reach a human inbox.

For teams interested in how this principle plays out across real exploit scenarios and end-to-end testing workflows, I cover how hypothesis-driven exploration and validation work together in practice in this webinar.

https://xbow-website-b1b.pages.dev/traces/

Albert

Ziegler

Head of AI

Bluesky

Github