Benchmarks
We're working toward an official, verified place on the Terminal-Bench leaderboard. We're not there yet. What's below is our first controlled, self-run comparison — shared in full, including the raw result files and exactly how to reproduce it. Until we complete the official run, we're also benchmarking against other open models to keep building the picture honestly.
Terminal-Bench 2.0 — Atlarix vs opencode
All 89 tasks · MiniMax-M3 (fp8) · single attempt (k=1) · same model, provider & infrastructure on both sides.
| Harness | Resolved | Score |
|---|---|---|
| Atlarix | 42 / 89 | 47% |
| opencode | 39 / 89 | 44% |
In our single run Atlarix resolved 3 more tasks than opencode. We are deliberately not claiming a win from that. The honest takeaway is that an open model expresses its ability roughly as well under Atlarix as under a leading harness — the harness isn't holding the model back. Absolute scores sit below this model's ~66% tuned-scaffold ceiling, as expected for a general harness at fp8 / single-attempt.
Raw results (verify it yourself)
These are the unedited output files from the run — the Harbor job results with per-task pass/fail for both harnesses. Nothing here is hand-typed; download and check them.
Reproduce it yourself
The exact bundle we ran is public — the same Electron-free Atlarix headless build, downloadable as a release tarball. The benchmark runs on the open-source Harbor framework. Setup we used:
- Dataset:
terminal-bench/terminal-bench-2(all 89 tasks) - Model:
minimax/minimax-m3, routed through OpenRouter and pinned to one provider at fp8 — identical for both harnesses - Infrastructure: Harbor on Modal (
-e modal), one isolated container per task - Settings (both harnesses, identical): single attempt (
-k 1), native timeout (--timeout-multiplier 1), native function-calling forced (no text-tool shim) - Atlarix bundle (public): atlarix-headless-linux-amd64.tar.gz
# Atlarix harness
harbor run -d terminal-bench/terminal-bench-2 \
-m openai/minimax/minimax-m3 \
-n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
-e modal --agent-import-path atlarix_tb:AtlarixAgent
# opencode harness (same model + provider + infra)
harbor run -d terminal-bench/terminal-bench-2 \
-m bench/minimax/minimax-m3 \
-n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
-e modal --agent-import-path atlarix_tb.opencode_proxy:BenchOpenCodeAgentFull disclosure: the one change we made
In the Atlarix desktop app, the agent asks your approval before every file write and command — that's a core safety feature. A benchmark runs unattended, with no human to approve anything. So to participate at all, we grant that approval once, up front, via an explicit operator flag (ATLARIX_AUTONOMOUS_DANGER=1). Without it, every task that needs an install, a cleanup, or a privileged command would simply be blocked and fail.
This is the only deviation from the shipping app's default behavior — and it isn't an advantage over the other harness: everyagent auto-approves to run an automated benchmark (it's inherent to running unattended). We're stating it plainly so the setup is fully transparent. The flag is off by default; the interactive app always asks.
Where we are, and what's next
- Official leaderboard: a full 5-attempt (k=5) run for a verified Terminal-Bench submission — the goal we're working toward.
- More open models: the same head-to-head on DeepSeek, Kimi, and Qwen, so any claim holds across the open-weight frontier, not one model.
- More benchmarks: SWE-bench and beyond, so the picture never rests on a single test.
Full write-up: benchmark blog post.