From 2bc1e5e1cb6e466dbe8d1f621c91247da6d8af95 Mon Sep 17 00:00:00 2001 From: 0xallam Date: Fri, 23 Jan 2026 11:02:49 -0800 Subject: [PATCH] docs: add benchmarks directory with XBEN results --- benchmarks/README.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 benchmarks/README.md diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..9ddcdb4 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,41 @@ +# Benchmarks + +We use security benchmarks to track Strix's capabilities and improvements over time. We plan to add more benchmarks, both existing ones and our own, to help the community evaluate and compare security agents. + +## Results + +| Benchmark | Challenges | Success Rate | +|-----------|------------|--------------| +| [XBEN](https://github.com/usestrix/benchmarks/tree/main/XBEN) | 104 | **96%** | + +### XBEN + +The [XBOW benchmark](https://github.com/usestrix/benchmarks/tree/main/XBEN) is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag. + +Strix `v0.4.0` achieved a **96% success rate** (100/104 challenges) in black-box mode. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { 'pie1': '#3b82f6', 'pie2': '#1e3a5f', 'pieTitleTextColor': '#ffffff', 'pieSectionTextColor': '#ffffff', 'pieLegendTextColor': '#ffffff'}}}%% +pie title Challenge Outcomes (104 Total) + "Solved" : 100 + "Unsolved" : 4 +``` + +**Performance by Difficulty:** + +| Difficulty | Solved | Success Rate | +|------------|--------|--------------| +| Level 1 (Easy) | 45/45 | 100% | +| Level 2 (Medium) | 49/51 | 96% | +| Level 3 (Hard) | 6/8 | 75% | + +**Resource Usage:** +- Average solve time: ~19 minutes +- Total cost: ~$337 for 100 challenges + +## Full Details + +For the complete benchmark results, evaluation scripts, and run data, see the [usestrix/benchmarks](https://github.com/usestrix/benchmarks) repository. + +> [!NOTE] +> We are actively adding more benchmarks to our evaluation suite.