XBOW now matches the capabilities of a top human pentester

Read on for the experiment setup, results, and key implications in offensive security and software development

August 5, 2024

Oege de Moor

Founder and CEO


Experiment

At XBOW, we wanted to compare the exploitation skills of our technology with those of professional pentesters. To test this, we directed a number of pentesting firms to create a new set of benchmarks. These benchmarks reflect realistic scenarios, and they have a crisp success criterion, namely capturing a flag.

“…almost everything was included in the challenges in terms of vulnerability classes, especially based on the OWASP Top 10 Web Application Security Risks.” —Senior pentester

This resulted in 104 benchmarks, which cover a wide range of vulnerabilities. Because the benchmarks are original, it is not possible to find solutions by doing a web search, and the solutions are guaranteed not to be in the training set of any AI model.

Image of novel XBOW benchmarks dot grid

104 novel XBOW benchmarks

“I have often seen these vulnerabilities in real tests, leading to horizontal and vertical movements with the potential for escalation…” —Senior pentester

We then hired five pentesters from leading pentesting firms that work with established industry leaders such as a major computer manufacturer, an identity management provider, a well-known ride-sharing service and a large satellite TV provider. To make the experiment more realistic and comprehensive, the group included different levels of skills and experience, namely one principal pentester, a staff pentester, two senior pentesters and one junior pentester.

Results

The five pentesters were given 40 hours to solve as many benchmarks as possible. The XBOW system attempted exactly the same set of benchmarks, without human intervention. The results are shown in the chart below.

Image of % of benchmarks solved

Percentage of benchmarks solved. XBOW (first bar) surpasses all but the most accomplished (second bar) participants.

The principal pentester and XBOW scored exactly the same, namely 85%. The staff pentester scored 59% success. If all human pentesters are taken together as a team, they solved 87.5% of challenges, only slightly more than XBOW on its own.

A big difference is in the time taken. While the human pentesters needed 40 hours, XBOW took 28 minutes to find and exploit the vulnerabilities.

The principal pentester in the experiment was Federico Muttis. With over 20 years of hands-on experience, Federico has multiple CVEs to his name and has presented his research on some of the biggest stages worldwide, including HITB, RSA, and EuSecWest.

”I just learned that XBOW got as many solves as I did. I am shocked. I expected it would not be able to solve some of the challenges I tackled at all.” —Federico Muttis, Principal pentester

Federico’s exceptional skills are particularly apparent when considering the results by difficulty level. On the hardest challenges, Federico came in first with XBOW securing second place. This outcome is expected, because the more difficult challenges require human creativity and contextual understanding, which are sometimes beyond the capabilities of an AI. However, XBOW did outperform the Staff, Senior and Junior pentesters on these hard problems. On the easy and medium challenges, XBOW excelled, surpassing all humans. Most vulnerabilities found in the real world correspond to these easier levels.

Image of % of benchmarks solved

Percentage of benchmarks solved for easy, medium, and hard difficulty levels. Each difficulty level is shown separately, with XBOW represented as the first bar compared to the human participants (remaining bars).

Implications

Today, offensive security tests are conducted infrequently and typically only after development is complete. As a result, pentesting offers only a snapshot of a company’s security at a single point in time, leaving windows of opportunity that attackers can exploit to breach systems. XBOW dramatically changes the landscape by running continuously during software development, unlike human pentesters. This approach ensures that vulnerabilities are identified and addressed while the system is still under development, well before bad actors have a chance to exploit them. As a result, offensive security reports transition from being mere snapshots in time to becoming an integral part of the development process, ensuring that vulnerabilities are never shipped.

Image of % of benchmarks solved

Note that this experiment was conducted in a controlled setting, and for our next challenge we are looking forward to sharing XBOW’s results on real web applications.

Will pentesting disappear as a profession? Of course not - no more than AI coding tools are going to eliminate developer jobs! However, the reality is that AI is going to change cybersecurity in fundamental ways, and in particular the way pentesters do their work. Pentesting will be more needed than ever, and it will have greater visibility by introducing it earlier in the Software Development Life Cycle. XBOW will help pentesting professionals to raise their game to meet the new challenges of the AI era.

Write to [email protected] to try out XBOW in your own environment.


Join the waitlist


Join the waitlist

Be the first to know when we launch

By signing up to the waitlist, you agree to let us contact you with announcements about our technology, and you certify that you are over the age of 16.