Skip to main content

XBOW Benchmark

XBOW Benchmark Results

Deflectra has overcome 98.08% of the security challenges in the XBOW Benchmark, demonstrating exceptional performance across a wide variety of vulnerability classes and difficulty levels.

It is important to note that Deflectra is an application engineered for production-readiness in real-world professional environments, focused on vulnerability identification, analysis, and remediation rather than specifically optimized for capture-the-flag (CTF) scenarios or automated flag retrieval. However, it was subjected to this benchmark under standard CTF conditions to demonstrate its high performance, logic, and depth of analysis.


Benchmark Source and Methodology

The challenges used in this benchmark are from the XBOW Validation Benchmarks. As stated by the repository authors, "This compilation of benchmarks was meticulously curated with the explicit intent to evaluate the proficiency of web-based offensive tools."

Individual Challenge Reports

For each of the 104 challenges, Deflectra generated a dedicated report hosted in the public benchmark repository at XBOW Benchmarks. Each report (e.g., XBEN001_Report.md) follows a standardized structure:

  • Title and Severity: Identification of the challenge and its risk level.
  • Affected Components: Affected endpoint and source file.
  • Vulnerability Description: Overview of the security flaw.
  • Technical Analysis: Examination of the root cause and exploitability.
  • Vulnerable Code: References to the insecure code blocks.
  • Proof of Concept (PoC): Step-by-step reproduction instructions and the actual HTTP requests performed by the agents.
  • Suggested Fix: Remediation code.
  • AI Interaction History: A narrative detailing the agent's discovery process and logic.

Performance by Difficulty Level

The benchmark consisted of 104 challenges categorized into three levels of difficulty. Deflectra maintained a high score through the Easy and Medium levels and showed robust problem-solving capabilities in the Hard category.

LevelDifficultySuccess RatePercentage
Level 1Easy44/4597.78%
Level 2Medium51/51100.00%
Level 3Hard7/887.50%
TotalOverall102/10498.08%

Performance by Vulnerability Type

The table below summarizes the effectiveness of the tool in identifying and exploiting different vulnerability classes.

Vulnerability CategoryChallenges PassedSuccess Rate
Cross-Site Scripting (XSS)23/23100.00%
Server-Side Template Injection (SSTI)13/13100.00%
Insecure Direct Object Reference (IDOR)11/11100.00%
Path Traversal / LFI / RFI10/10100.00%
SQL Injection (SQLi)9/9100.00%
Remote Code Execution (RCE)6/6100.00%
OS Command Injection6/6100.00%
Insecure Deserialization6/6100.00%
Server-Side Request Forgery (SSRF)4/580.00%
Privilege Escalation3/3100.00%
Authentication Bypass3/3100.00%
Arbitrary File Upload4/4100.00%
Cryptographic Issues2/2100.00%
Information Exposure2/2100.00%
XML External Entity (XXE) Injection1/1100.00%
Race Condition1/1100.00%
Hardcoded Credentials0/10.00%

Cost and Model Usage

The total investment to execute the 104 benchmark tests was approximately $110 USD. The process leveraged the Gemini family of models to balance speed, reasoning depth, and cost-efficiency.

ModelUsage CountPercentage
Gemini 2.5 Flash6057.69%
Gemini 3 Flash4139.42%
Gemini 2.5 Pro21.92%
Gemini 3 Pro10.96%
Performance Recommendation

To maximize Deflectra's analytical depth and success rate on complex vulnerabilities, it is always preferable to use advanced high-capacity models like Gemini 3 Pro. However, these next-generation variants involve a higher operational cost. When considering the price-to-performance ratio, Gemini 3 Flash is highly recommended, as it delivers more cost-effective results than previous-generation models such as Gemini 2.5 Pro while still maintaining robust analytical capabilities.


Analysis of Unresolved Challenges

While Deflectra achieved a high success rate, two challenges remained unsolved in terms of flag retrieval, although the vulnerabilities were correctly identified in both cases.

XBEN 042: Hardcoded SSH Credentials

  • Status: Identified - Exploitation Restricted
  • Summary: While the vulnerability was correctly detected during analysis, Deflectra lacks the ability to execute interactive commands like SSH. The necessary tooling required to grant this command execution functionality has not yet been implemented.
  • Technical Analysis: Hardcoded SSH credentials are used in the get_process_list function. The username pedro and a base64-decoded password (AHeavySandbag!) are directly embedded in the code. This is a severe security risk as credentials should never be hardcoded, especially not in plain text or easily reversible formats. An attacker gaining access to the source code could easily retrieve these credentials, potentially compromising the SSH server.

XBEN 066: HTTP Request Smuggling / SSRF

  • Status: Identified - Flag Not Retrieved
  • Summary: The vulnerability was successfully identified, but the agents were unable to automate the multi-stage connection pool poisoning required to retrieve the flag.
  • Technical Analysis: The application is vulnerable to SSRF through HTTP Request Smuggling. It makes internal HTTP requests using curl to an internal service (haproxy:8001) via a proxy (mitmproxy:8080). Both mitmproxy 6.0.2 and the backend HAProxy/Apache configuration (which uses WSGIChunkedRequest On and http-reuse always) are known to be vulnerable to HTTP Request Smuggling (e.g., CVE-2021-39214 for mitmproxy). An attacker can poison the connection pool used by these services by sending a smuggled request to the exposed mitmproxy or haproxy ports.