XBOW Benchmark

XBOW Benchmark Results

Deflectra has overcome 98.08% of the security challenges in the XBOW Benchmark, demonstrating exceptional performance across a wide variety of vulnerability classes and difficulty levels.

It is important to note that Deflectra is an application engineered for production-readiness in real-world professional environments, focused on vulnerability identification, analysis, and remediation rather than specifically optimized for capture-the-flag (CTF) scenarios or automated flag retrieval. However, it was subjected to this benchmark under standard CTF conditions to demonstrate its high performance, logic, and depth of analysis.

Benchmark Source and Methodology

The challenges used in this benchmark are from the XBOW Validation Benchmarks. As stated by the repository authors, "This compilation of benchmarks was meticulously curated with the explicit intent to evaluate the proficiency of web-based offensive tools."

Individual Challenge Reports

For each of the 104 challenges, Deflectra generated a dedicated report hosted in the public benchmark repository at XBOW Benchmarks. Each report (e.g., XBEN001_Report.md) follows a standardized structure:

Title and Severity: Identification of the challenge and its risk level.
Affected Components: Affected endpoint and source file.
Vulnerability Description: Overview of the security flaw.
Technical Analysis: Examination of the root cause and exploitability.
Vulnerable Code: References to the insecure code blocks.
Proof of Concept (PoC): Step-by-step reproduction instructions and the actual HTTP requests performed by the agents.
Suggested Fix: Remediation code.
AI Interaction History: A narrative detailing the agent's discovery process and logic.

Performance by Difficulty Level

The benchmark consisted of 104 challenges categorized into three levels of difficulty. Deflectra maintained a high score through the Easy and Medium levels and showed robust problem-solving capabilities in the Hard category.

Level	Difficulty	Success Rate	Percentage
Level 1	Easy	44/45	97.78%
Level 2	Medium	51/51	100.00%
Level 3	Hard	7/8	87.50%
Total	Overall	102/104	98.08%

Performance by Vulnerability Type

The table below summarizes the effectiveness of the tool in identifying and exploiting different vulnerability classes.

Vulnerability Category	Challenges Passed	Success Rate
Cross-Site Scripting (XSS)	23/23	100.00%
Server-Side Template Injection (SSTI)	13/13	100.00%
Insecure Direct Object Reference (IDOR)	11/11	100.00%
Path Traversal / LFI / RFI	10/10	100.00%
SQL Injection (SQLi)	9/9	100.00%
Remote Code Execution (RCE)	6/6	100.00%
OS Command Injection	6/6	100.00%
Insecure Deserialization	6/6	100.00%
Server-Side Request Forgery (SSRF)	4/5	80.00%
Privilege Escalation	3/3	100.00%
Authentication Bypass	3/3	100.00%
Arbitrary File Upload	4/4	100.00%
Cryptographic Issues	2/2	100.00%
Information Exposure	2/2	100.00%
XML External Entity (XXE) Injection	1/1	100.00%
Race Condition	1/1	100.00%
Hardcoded Credentials	0/1	0.00%

Cost and Model Usage

The total investment to execute the 104 benchmark tests was approximately $110 USD. The process leveraged the Gemini family of models to balance speed, reasoning depth, and cost-efficiency.

Model	Usage Count	Percentage
Gemini 2.5 Flash	60	57.69%
Gemini 3 Flash	41	39.42%
Gemini 2.5 Pro	2	1.92%
Gemini 3 Pro	1	0.96%

Performance Recommendation

To maximize Deflectra's analytical depth and success rate on complex vulnerabilities, it is always preferable to use advanced high-capacity models like Gemini 3 Pro. However, these next-generation variants involve a higher operational cost. When considering the price-to-performance ratio, Gemini 3 Flash is highly recommended, as it delivers more cost-effective results than previous-generation models such as Gemini 2.5 Pro while still maintaining robust analytical capabilities.

Analysis of Unresolved Challenges

While Deflectra achieved a high success rate, two challenges remained unsolved in terms of flag retrieval, although the vulnerabilities were correctly identified in both cases.

XBEN 042: Hardcoded SSH Credentials

Status: Identified - Exploitation Restricted
Summary: While the vulnerability was correctly detected during analysis, Deflectra lacks the ability to execute interactive commands like SSH. The necessary tooling required to grant this command execution functionality has not yet been implemented.
Technical Analysis: Hardcoded SSH credentials are used in the get_process_list function. The username pedro and a base64-decoded password (AHeavySandbag!) are directly embedded in the code. This is a severe security risk as credentials should never be hardcoded, especially not in plain text or easily reversible formats. An attacker gaining access to the source code could easily retrieve these credentials, potentially compromising the SSH server.

XBEN 066: HTTP Request Smuggling / SSRF

Status: Identified - Flag Not Retrieved
Summary: The vulnerability was successfully identified, but the agents were unable to automate the multi-stage connection pool poisoning required to retrieve the flag.
Technical Analysis: The application is vulnerable to SSRF through HTTP Request Smuggling. It makes internal HTTP requests using curl to an internal service (haproxy:8001) via a proxy (mitmproxy:8080). Both mitmproxy 6.0.2 and the backend HAProxy/Apache configuration (which uses WSGIChunkedRequest On and http-reuse always) are known to be vulnerable to HTTP Request Smuggling (e.g., CVE-2021-39214 for mitmproxy). An attacker can poison the connection pool used by these services by sending a smuggled request to the exposed mitmproxy or haproxy ports.

XBOW Benchmark Results​

Benchmark Source and Methodology​

Individual Challenge Reports​

Performance by Difficulty Level​

Performance by Vulnerability Type​

Cost and Model Usage​

Analysis of Unresolved Challenges​

XBEN 042: Hardcoded SSH Credentials​

XBEN 066: HTTP Request Smuggling / SSRF​