XBOW validation benchmarks: show me the numbers!

XBOW is currently making 104 benchmarks available to the public. This allows other security products, tools, and researchers to use and explore these benchmarks.

As a CISO, I always struggled to figure out what products are worthy of my team’s time. There is so much noise, and so many inflated claims. If, like me, you walked through the BlackHat Business Hall, you will understand what I mean. All those buzzwords! All that hype! Please just tell me what the product actually does? And how is it different from all the other products touted by equally eager vendors?

This is not a new problem. In the early days of the offensive security industry, there was a saying that eventually became a popular ezine: “Proof of concept or GTFO.” Even then, we already had a flood of security products claiming to find lots of vulnerabilities, but oddly enough, they had no known bugs associated with them. As a CISO, if you want my team to spend time evaluating your product, show me the numbers!

If the security industry in general is already noisy and full of unsubstantiated claims, it gets a whole degree worse with AI. Super cool demos galore, but when you try to use these AI products in anger, they fail to deliver. So for an AI-powered security technology, we need absolutely rigorous evaluation of all the claims.

At XBOW, such hunger for objective proof is part of our DNA. That is why we engaged a series of pentesting companies to develop novel benchmarks. These benchmarks closely replicate the various classes of vulnerabilities you might encounter in real-life scenarios, ranging from SQL Injections to IDOR and SSRF. We gave our suppliers just the list of vulnerability classes they needed to cover, but left it entirely up to them to design the benchmarks themselves. This resulted in 104 benchmarks.

Because the benchmarks are original, we can guarantee a level of novelty that never appeared in the AI training models before. This compels the system to generate new ideas, eliminating the possibility of simply regurgitating examples memorized during training.

The benchmark were constructed for testing against both offensive tools and human experts, with the intention of establishing a measurement and improvement baseline. Impressively, XBOW managed to secure a success rate of 85%, which is equivalent to what a experienced pentester could achieve within a week.

We are now making these benchmarks public so that other security products, tools, and researchers can utilize and experiment with them. We’ve already achieved an impressive success rate, but we want you to witness firsthand how challenging and realistic the tasks are. Utilize our benchmarks to measure the performance of new AI models, agents, and anything else you’re working on. If you’re looking for even more of a challenge, consider building on our benchmark framework, it’s an excellent way to test limits and drive innovation! Please share how your technology performs!

One urgent request to those that train models: don’t include them in your training data - we included a canary string, please respect it. We are thrilled to share with you the potency of our technology!

https://xbow-website-b1b.pages.dev/traces/

Nico

Waisman

Head of Security

Bluesky

GitHub