Benchmarks

    We're working toward an official, verified place on the Terminal-Bench leaderboard. We're not there yet. What's below is our first controlled, self-run comparison — shared in full, including the raw result files and exactly how to reproduce it. Until we complete the official run, we're also benchmarking against other open models to keep building the picture honestly.

    Read this first. This is a preliminary, single-attempt (k=1) run we did ourselves. The official Terminal-Bench leaderboard requires 5 attempts (k=5), which we have not yet done. So this does not prove Atlarix is "ahead" of anyone — a 3-task difference at single-attempt is within run-to-run noise. The most we read into it: on this model, Atlarix is competitive with a leading harness. That's it.

    Terminal-Bench 2.0 — Atlarix vs opencode

    All 89 tasks · MiniMax-M3 (fp8) · single attempt (k=1) · same model, provider & infrastructure on both sides.

    HarnessResolvedScore
    Atlarix42 / 8947%
    opencode39 / 8944%

    In our single run Atlarix resolved 3 more tasks than opencode. We are deliberately not claiming a win from that. The honest takeaway is that an open model expresses its ability roughly as well under Atlarix as under a leading harness — the harness isn't holding the model back. Absolute scores sit below this model's ~66% tuned-scaffold ceiling, as expected for a general harness at fp8 / single-attempt.

    Raw results (verify it yourself)

    These are the unedited output files from the run — the Harbor job results with per-task pass/fail for both harnesses. Nothing here is hand-typed; download and check them.

    Reproduce it yourself

    The exact bundle we ran is public — the same Electron-free Atlarix headless build, downloadable as a release tarball. The benchmark runs on the open-source Harbor framework. Setup we used:

    • Dataset: terminal-bench/terminal-bench-2 (all 89 tasks)
    • Model: minimax/minimax-m3, routed through OpenRouter and pinned to one provider at fp8 — identical for both harnesses
    • Infrastructure: Harbor on Modal (-e modal), one isolated container per task
    • Settings (both harnesses, identical): single attempt (-k 1), native timeout (--timeout-multiplier 1), native function-calling forced (no text-tool shim)
    • Atlarix bundle (public): atlarix-headless-linux-amd64.tar.gz
    # Atlarix harness
    harbor run -d terminal-bench/terminal-bench-2 \
      -m openai/minimax/minimax-m3 \
      -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
      -e modal --agent-import-path atlarix_tb:AtlarixAgent
    
    # opencode harness (same model + provider + infra)
    harbor run -d terminal-bench/terminal-bench-2 \
      -m bench/minimax/minimax-m3 \
      -n 24 -k 1 -y --timeout-multiplier 1 --max-retries 3 \
      -e modal --agent-import-path atlarix_tb.opencode_proxy:BenchOpenCodeAgent

    Full disclosure: the one change we made

    In the Atlarix desktop app, the agent asks your approval before every file write and command — that's a core safety feature. A benchmark runs unattended, with no human to approve anything. So to participate at all, we grant that approval once, up front, via an explicit operator flag (ATLARIX_AUTONOMOUS_DANGER=1). Without it, every task that needs an install, a cleanup, or a privileged command would simply be blocked and fail.

    This is the only deviation from the shipping app's default behavior — and it isn't an advantage over the other harness: everyagent auto-approves to run an automated benchmark (it's inherent to running unattended). We're stating it plainly so the setup is fully transparent. The flag is off by default; the interactive app always asks.

    Where we are, and what's next

    • Official leaderboard: a full 5-attempt (k=5) run for a verified Terminal-Bench submission — the goal we're working toward.
    • More open models: the same head-to-head on DeepSeek, Kimi, and Qwen, so any claim holds across the open-weight frontier, not one model.
    • More benchmarks: SWE-bench and beyond, so the picture never rests on a single test.

    Full write-up: benchmark blog post.