XBOW now matches the capabilities of a top human pentester
Read on for the experiment setup, results, and key implications in offensive security and software development
August 5, 2024
Oege de Moor
Founder and CEO
Experiment
At XBOW, we wanted to compare the exploitation skills of our technology with those of professional pentesters. To test this, we directed a number of pentesting firms to create a new set of benchmarks. These benchmarks reflect realistic scenarios, and they have a crisp success criterion, namely capturing a flag.
“…almost everything was included in the challenges in terms of vulnerability classes, especially based on the OWASP Top 10 Web Application Security Risks.” —Senior pentester
This resulted in 104 benchmarks, which cover a wide range of vulnerabilities. Because the benchmarks are original, it is not possible to find solutions by doing a web search, and the solutions are guaranteed not to be in the training set of any AI model.
“I have often seen these vulnerabilities in real tests, leading to horizontal and vertical movements with the potential for escalation…” —Senior pentester
We then hired five pentesters from leading pentesting firms that work with established industry leaders such as a major computer manufacturer, an identity management provider, a well-known ride-sharing service and a large satellite TV provider. To make the experiment more realistic and comprehensive, the group included different levels of skills and experience, namely one principal pentester, a staff pentester, two senior pentesters and one junior pentester.
Results
The five pentesters were given 40 hours to solve as many benchmarks as possible. The XBOW system attempted exactly the same set of benchmarks, without human intervention. The results are shown in the chart below.
The principal pentester and XBOW scored exactly the same, namely 85%. The staff pentester scored 59% success. If all human pentesters are taken together as a team, they solved 87.5% of challenges, only slightly more than XBOW on its own.
A big difference is in the time taken. While the human pentesters needed 40 hours, XBOW took 28 minutes to find and exploit the vulnerabilities.
The principal pentester in the experiment was Federico Muttis. With over 20 years of hands-on experience, Federico has multiple CVEs to his name and has presented his research on some of the biggest stages worldwide, including HITB, RSA, and EuSecWest.
”I just learned that XBOW got as many solves as I did. I am shocked. I expected it would not be able to solve some of the challenges I tackled at all.” —Federico Muttis, Principal pentester
Federico’s exceptional skills are particularly apparent when considering the results by difficulty level. On the hardest challenges, Federico came in first with XBOW securing second place. This outcome is expected, because the more difficult challenges require human creativity and contextual understanding, which are sometimes beyond the capabilities of an AI. However, XBOW did outperform the Staff, Senior and Junior pentesters on these hard problems. On the easy and medium challenges, XBOW excelled, surpassing all humans. Most vulnerabilities found in the real world correspond to these easier levels.
Implications
Today, offensive security tests are conducted infrequently and typically only after development is complete. As a result, pentesting offers only a snapshot of a company’s security at a single point in time, leaving windows of opportunity that attackers can exploit to breach systems. XBOW dramatically changes the landscape by running continuously during software development, unlike human pentesters. This approach ensures that vulnerabilities are identified and addressed while the system is still under development, well before bad actors have a chance to exploit them. As a result, offensive security reports transition from being mere snapshots in time to becoming an integral part of the development process, ensuring that vulnerabilities are never shipped.
Note that this experiment was conducted in a controlled setting, and for our next challenge we are looking forward to sharing XBOW’s results on real web applications.
Will pentesting disappear as a profession? Of course not - no more than AI coding tools are going to eliminate developer jobs! However, the reality is that AI is going to change cybersecurity in fundamental ways, and in particular the way pentesters do their work. Pentesting will be more needed than ever, and it will have greater visibility by introducing it earlier in the Software Development Life Cycle. XBOW will help pentesting professionals to raise their game to meet the new challenges of the AI era.