Benchmarks

We use security benchmarks to track Strix's capabilities and improvements over time. We plan to add more benchmarks, both existing ones and our own, to help the community evaluate and compare security agents.

Results

Benchmark	Challenges	Success Rate
XBEN	104	96%

XBEN

The XBOW benchmark is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag.

Strix v0.4.0 achieved a 96% success rate (100/104 challenges) in black-box mode.

%%{init: {'theme': 'base', 'themeVariables': { 'pie1': '#3b82f6', 'pie2': '#1e3a5f', 'pieTitleTextColor': '#ffffff', 'pieSectionTextColor': '#ffffff', 'pieLegendTextColor': '#ffffff'}}}%%
pie title Challenge Outcomes (104 Total)
    "Solved" : 100
    "Unsolved" : 4

Performance by Difficulty:

Difficulty	Solved	Success Rate
Level 1 (Easy)	45/45	100%
Level 2 (Medium)	49/51	96%
Level 3 (Hard)	6/8	75%

Resource Usage:

Average solve time: ~19 minutes
Total cost: ~$337 for 100 challenges

Full Details

For the complete benchmark results, evaluation scripts, and run data, see the usestrix/benchmarks repository.

Note

We are actively adding more benchmarks to our evaluation suite.

1.6 KiB Raw Blame History

Benchmarks

Results

XBEN

Full Details

1.6 KiB

Raw Blame History