Harbor / Terminal-Bench results

Roder evals over time

This page tracks the benchmark runs we use to measure whether Roder is becoming more reliable and more capable. Full-suite scores are shown separately from targeted reruns and provider-validation subsets so the trend stays honest. These are development trajectory results, not final benchmark submissions.

Last updated 26 May, 20:19
+7full-suite passes since baseline
10fewer soft timeouts vs first run
0Harbor errors in latest full run
56.2%latest full-suite pass rate

Theoretical leaderboard context

If the latest in-development full-suite result were compared directly against the current Terminal-Bench 2.0 leaderboard, its 56.2% pass rate would theoretically place around #56 of 143 entries.

These Roder evals are not a full submittable Terminal-Bench run. They are published to show the in-development trajectory of Roder. We will run and publish a full submittable benchmark separately when the harness is ready.

#55 56.9%

Terminus 2 with Gemini 3 Pro

Roder dev trajectory 56.2%

Theoretical #56

#56 56%

Letta Code with Gemini 3 Pro

Full-suite trend

48.3% 24 May, 19:16 52.8% 24 May, 23:46 56.2% 25 May, 03:05

Run history

Initial full GPT-5.5 medium run

full-suite
48.3%
43 passing tasks · 89 trials · codex/gpt-5.5 · medium · 24 May, 19:16
  • First clean full-suite score recorded in the current public dashboard.
  • Established the baseline for deadline, timeout, and score-improvement work.

Strict medium baseline

full-suite
52.8%
47 passing tasks · 89 trials · codex/gpt-5.5 · medium · 24 May, 23:46
  • Disabled speed-policy drift and removed disqualifying Docker task resource overrides.
  • Improved the baseline by four passes while keeping Harbor errors at zero.

Deadline and reliability full run

full-suite
56.2%
50 passing tasks · 89 trials · codex/gpt-5.5 · medium · 25 May, 03:05
  • Added deadline-aware command execution and graceful finalization paths.
  • Reached 50/89 on the full suite, a seven-pass lift from the initial baseline.

Remaining-failures xhigh rerun

targeted
20%
7 passing tasks · 35 trials · codex/gpt-5.5 · xhigh · 25 May, 13:58
  • Measured whether higher reasoning alone converts the remaining full-suite failures.
  • Produced seven conversions for later campaign planning.

Plan-first targeted rerun

targeted
14.3%
4 passing tasks · 28 trials · codex/gpt-5.5 · medium plan, xhigh implementation · 25 May, 17:29
  • Runs a planning turn, stores plan artifacts, then resumes the same thread for implementation.
  • Converted git-leak-recovery, model-extraction-relu-logits, polyglot-rust-c, and regex-chess.

Plan-first smoke

smoke
100%
1 passing tasks · 1 trials · codex/gpt-5.5 · medium plan, xhigh implementation · 25 May, 17:06
  • Validated plan-first mechanics on polyglot-rust-c before the broader targeted rerun.

Gemini 3.5 Flash validation

targeted
83.3%
5 passing tasks · 6 trials · gemini/gemini-3.5-flash · default · 26 May, 18:47
  • Provider validation subset for the Gemini path; not directly comparable to full-suite GPT-5.5 runs.

Full-suite detail

Run Score Lift Soft timeouts Policy blocks Clean
Initial full GPT-5.5 medium run
evals/reports/harbor/roder-tbench-full-gpt55-medium-analysis.json
48.3% baseline 21 total, 17 failed 1 clean
Strict medium baseline
evals/reports/harbor/roder-tbench-full-gpt55-medium-strict-analysis.json
52.8% +4 13 total, 11 failed 5 clean
Deadline and reliability full run
evals/reports/harbor/roder-tbench-full-gpt55-medium-deadline-reliability-analysis.json
56.2% +3 11 total, 8 failed 6 clean
How to keep this current. Add each new Harbor analysis artifact to src/data/evalResults.ts with the suite, model, pass count, clean-run status, notable failure signals, and the report path. Full-suite runs appear in the trend chart; targeted and smoke runs stay in the history without being blended into the headline score.