The road to Top 1: How XBOW did it

For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard.

Our path to reaching the top ranks on HackerOne began with rigorous benchmarking. Since the early days of XBOW, we understood how crucial it was to measure our progress, and we did that in two stages:

First we tested XBOW with existing CTF challenges (from well-known providers like PortSwigger and Pentesterlab), then quickly moved on and built our own unique benchmark that simulates real-world scenarios—ones never used to train LLMs before. The results were encouraging, but still these were artificial exercises.
The logical next step, therefore, was to focus on discovering zero-day vulnerabilities in open source projects, which led to many exciting findings. Some of these were reported on this blog before: in every case, we gave the AI access to source code, simulating a white-box pentest. While our paying customers were enthusiastic about XBOW’s capabilities, the community raised a key question: How would XBOW perform in real, black-box production environments? We took up that challenge, choosing to compete in one of the largest hacker arenas, where companies serve as the ultimate judges by verifying and triaging vulnerabilities themselves.

Dogfooding AI in Bug Bounties

XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive penetration tests in just a few hours.

When building AI software, having precise benchmarks to keep pushing the limit of what’s possible, is essential. But when some of those benchmarks evolve into real-world environments, it’s a developer’s dream come true.

Discovering bugs in structured benchmarks and open source projects was a fantastic starting point. However, nothing can truly prepare you for the immense diversity of real-world environments, which span from cutting-edge technologies to 30-year-old legacy systems. No number of design partners can offer that breadth of system variety as that level of unpredictability is nearly impossible to simulate.

To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

HackerOne offers this unique opportunity, and as XBOW discovered and reported vulnerabilities across multiple programs, we soon found ourselves climbing the H1 ranks.

Scaling Discovery and Scoping capabilities

Our first challenge was scaling. While XBOW can easily scan thousands of web apps simultaneously, HackerOne hosts hundreds of thousands of potential targets. As a startup with limited resources, even when we focused on specific vulnerability classes, we still needed to be strategic. That’s why we built infrastructure on top of XBOW to help us identify the high-value targets and prioritize those that would maximize our return on investment.

We started by consuming bug bounty program scopes and policies, but this information isn’t always machine-readable. With a combination of large language models and some manual curation, we managed to parse through them—with a few hiccups. (At one point, we were officially removed from a program that didn’t allow “automatic scanners.”)

With the domains ingested into our database, and a bit of “magic” to expand subdomains, we built a scoring system to highlight the most interesting targets. This scoring criteria covered a broad range of signals, including target appearance, presence of WAFs and other protections, HTTP status codes, redirect behavior, authentication forms, number of reachable endpoints, underlying technologies, and more.

Domain deduplication quickly became essential in large programs, it is common to encounter cloned or staging environments(e.g. stage0001-dev.example.com). Once a vulnerability is found in one, similar issues are likely to exist across others. To stay efficient, we used SimHash to detect content-level similarity and leveraged a headless browser to capture website screenshots and then applied imagehash techniques to assess visual similarity analysis, allowing us to group assets and focus our efforts on unique, high-impact targets.

Automated Vulnerability Discovery

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

XBOW’s Real-World Impact

Running XBOW across a wide range of public and private programs yielded results that exceeded our expectations—not just in volume, but in consistency and quality.

Over time, XBOW reported thousands of validated vulnerabilities, many of them affecting high-profile targets from well-known companies. These findings weren’t just theoretical; every submission was confirmed by the program owners and triaged as real, actionable security issues.

The most public signal of progress came from the HackerOne leaderboard. Competing alongside thousands of human researchers, XBOW climbed to the top position in the US ranking. That wasn’t our original goal, and indeed was surprising since we didn’t have a buffer of untriaged reports from previous quarters—but it became a useful benchmark to track real-world performance and collect traces to reinforce our models.

XBOW submitted nearly 1,060 vulnerabilities. All findings were fully automated, though our security team reviewed them pre-submission to comply with HackerOne’s policy on automated tools. It was a unique privilege to wake up each morning and review creative new exploits.

To date, bug bounty programs have resolved 130 vulnerabilities, while 303 were classified as Triaged (mostly by VDP programs that acknowledged the issue but did not proceed to resolution). In addition, 33 reports are currently marked as new, and 125 remain pending review by program owners.

Across all submissions, 208 were marked as duplicates, 209 as informative and 36 as not applicable (most of them self-closed by our team). Interestingly, many of these informative vulnerabilities came from programs with specific constraints such as policies excluding third-party vulnerabilities or disallowing certain classes like Cache Poisoning.

XBOW identified a full spectrum of vulnerabilities including: Remote Code Execution, SQL Injection, XML External Entities (XXE), Path Traversal, Server-Side Request Forgery (SSRF), Cross-Site Scripting, Information DIsclosures, Cache Poisoning, Secret exposure, and more.

Over the past 90 days alone, the vulnerabilities submitted were classified as 54 critical, 242 high, 524 medium, and 65 low severity issues by program owners. Notably, around 45% of XBOW’s findings are still awaiting resolution, highlighting the volume and impact of the submissions across live targets.

XBOW’s path to the top involved uncovering a wide range of interesting and impactful vulnerabilities. Among them was a previously unknown vulnerability in Palo Alto’s GlobalProtect VPN solution, affecting over 2,000 hosts. Throughout this process, XBOW consistently demonstrated its ability to adapt to edge cases and develop creative strategies for complex exploitation scenarios entirely on its own.

In the spirit of transparency, and in accordance with the rules and regulations of POC || GTFO, our security team will be publishing a series of blog posts over the coming weeks, showcasing some of our favorite technical discoveries by XBOW.

XBOW is an enterprise solution. If your company would like a demo, email us at info@xbow.com.

https://xbow-website-b1b.pages.dev/traces/

Nico

Waisman

Head of Security

Bluesky

GitHub