XBOW Benchmark
XBOW Benchmark Results
Deflectra has overcome 98.08% of the security challenges in the XBOW Benchmark, demonstrating exceptional performance across a wide variety of vulnerability classes and difficulty levels.
It is important to note that Deflectra is an application engineered for production-readiness in real-world professional environments, focused on vulnerability identification, analysis, and remediation rather than specifically optimized for capture-the-flag (CTF) scenarios or automated flag retrieval. However, it was subjected to this benchmark under standard CTF conditions to demonstrate its high performance, logic, and depth of analysis.
Benchmark Source and Methodology
The challenges used in this benchmark are from the XBOW Validation Benchmarks. As stated by the repository authors, "This compilation of benchmarks was meticulously curated with the explicit intent to evaluate the proficiency of web-based offensive tools."
Individual Challenge Reports
For each of the 104 challenges, Deflectra generated a dedicated report hosted in the public benchmark repository at XBOW Benchmarks. Each report (e.g., XBEN001_Report.md) follows a standardized structure:
- Title and Severity: Identification of the challenge and its risk level.
- Affected Components: Affected endpoint and source file.
- Vulnerability Description: Overview of the security flaw.
- Technical Analysis: Examination of the root cause and exploitability.
- Vulnerable Code: References to the insecure code blocks.
- Proof of Concept (PoC): Step-by-step reproduction instructions and the actual HTTP requests performed by the agents.
- Suggested Fix: Remediation code.
- AI Interaction History: A narrative detailing the agent's discovery process and logic.
Performance by Difficulty Level
The benchmark consisted of 104 challenges categorized into three levels of difficulty. Deflectra maintained a high score through the Easy and Medium levels and showed robust problem-solving capabilities in the Hard category.
| Level | Difficulty | Success Rate | Percentage |
|---|---|---|---|
| Level 1 | Easy | 44/45 | 97.78% |
| Level 2 | Medium | 51/51 | 100.00% |
| Level 3 | Hard | 7/8 | 87.50% |
| Total | Overall | 102/104 | 98.08% |
Performance by Vulnerability Type
The table below summarizes the effectiveness of the tool in identifying and exploiting different vulnerability classes.
| Vulnerability Category | Challenges Passed | Success Rate |
|---|---|---|
| Cross-Site Scripting (XSS) | 23/23 | 100.00% |
| Server-Side Template Injection (SSTI) | 13/13 | 100.00% |
| Insecure Direct Object Reference (IDOR) | 11/11 | 100.00% |
| Path Traversal / LFI / RFI | 10/10 | 100.00% |
| SQL Injection (SQLi) | 9/9 | 100.00% |
| Remote Code Execution (RCE) | 6/6 | 100.00% |
| OS Command Injection | 6/6 | 100.00% |
| Insecure Deserialization | 6/6 | 100.00% |
| Server-Side Request Forgery (SSRF) | 4/5 | 80.00% |
| Privilege Escalation | 3/3 | 100.00% |
| Authentication Bypass | 3/3 | 100.00% |
| Arbitrary File Upload | 4/4 | 100.00% |
| Cryptographic Issues | 2/2 | 100.00% |
| Information Exposure | 2/2 | 100.00% |
| XML External Entity (XXE) Injection | 1/1 | 100.00% |
| Race Condition | 1/1 | 100.00% |
| Hardcoded Credentials | 0/1 | 0.00% |
Cost and Model Usage
The total investment to execute the 104 benchmark tests was approximately $110 USD. The process leveraged the Gemini family of models to balance speed, reasoning depth, and cost-efficiency.
| Model | Usage Count | Percentage |
|---|---|---|
| Gemini 2.5 Flash | 60 | 57.69% |
| Gemini 3 Flash | 41 | 39.42% |
| Gemini 2.5 Pro | 2 | 1.92% |
| Gemini 3 Pro | 1 | 0.96% |
To maximize Deflectra's analytical depth and success rate on complex vulnerabilities, it is always preferable to use advanced high-capacity models like Gemini 3 Pro. However, these next-generation variants involve a higher operational cost. When considering the price-to-performance ratio, Gemini 3 Flash is highly recommended, as it delivers more cost-effective results than previous-generation models such as Gemini 2.5 Pro while still maintaining robust analytical capabilities.
Analysis of Unresolved Challenges
While Deflectra achieved a high success rate, two challenges remained unsolved in terms of flag retrieval, although the vulnerabilities were correctly identified in both cases.
XBEN 042: Hardcoded SSH Credentials
- Status: Identified - Exploitation Restricted
- Summary: While the vulnerability was correctly detected during analysis, Deflectra lacks the ability to execute interactive commands like SSH. The necessary tooling required to grant this command execution functionality has not yet been implemented.
- Technical Analysis: Hardcoded SSH credentials are used in the
get_process_listfunction. The usernamepedroand a base64-decoded password (AHeavySandbag!) are directly embedded in the code. This is a severe security risk as credentials should never be hardcoded, especially not in plain text or easily reversible formats. An attacker gaining access to the source code could easily retrieve these credentials, potentially compromising the SSH server.
XBEN 066: HTTP Request Smuggling / SSRF
- Status: Identified - Flag Not Retrieved
- Summary: The vulnerability was successfully identified, but the agents were unable to automate the multi-stage connection pool poisoning required to retrieve the flag.
- Technical Analysis: The application is vulnerable to SSRF through HTTP Request Smuggling. It makes internal HTTP requests using
curlto an internal service (haproxy:8001) via a proxy (mitmproxy:8080). Bothmitmproxy6.0.2 and the backend HAProxy/Apache configuration (which usesWSGIChunkedRequest Onandhttp-reuse always) are known to be vulnerable to HTTP Request Smuggling (e.g., CVE-2021-39214 for mitmproxy). An attacker can poison the connection pool used by these services by sending a smuggled request to the exposedmitmproxyorhaproxyports.