Initial full GPT-5.5 medium run
full-suite- First clean full-suite score recorded in the current public dashboard.
- Established the baseline for deadline, timeout, and score-improvement work.
This page tracks the benchmark runs we use to measure whether Roder is becoming more reliable and more capable. Full-suite scores are shown separately from targeted reruns and provider-validation subsets so the trend stays honest. These are development trajectory results, not final benchmark submissions.
If the latest in-development full-suite result were compared directly against the current Terminal-Bench 2.0 leaderboard, its 56.2% pass rate would theoretically place around #56 of 143 entries.
These Roder evals are not a full submittable Terminal-Bench run. They are published to show the in-development trajectory of Roder. We will run and publish a full submittable benchmark separately when the harness is ready.
Terminus 2 with Gemini 3 Pro
Theoretical #56
Letta Code with Gemini 3 Pro
| Run | Score | Lift | Soft timeouts | Policy blocks | Clean |
|---|---|---|---|---|---|
| Initial full GPT-5.5 medium run | 48.3% | baseline | 21 total, 17 failed | 1 | clean |
| Strict medium baseline | 52.8% | +4 | 13 total, 11 failed | 5 | clean |
| Deadline and reliability full run | 56.2% | +3 | 11 total, 8 failed | 6 | clean |
src/data/evalResults.ts with the suite, model, pass count, clean-run
status, notable failure signals, and the report path. Full-suite runs appear in the
trend chart; targeted and smoke runs stay in the history without being blended into
the headline score.