Eval harness
The Harbor harness runs Roder against Terminal-Bench tasks with a prebuilt Linux binary, isolated config/auth directories, JSON event capture, and structured run summaries. It is a first-class consumer of the runtime, not a separate benchmark-only agent. Current pass/fail results live on the eval scoreboard.
Run shape
- Harbor launches
roder exec --json --profile evalinside each task container. - The adapter records
roder-events.jsonl, final assistant text, stderr diagnostics, androder-run-summary.json. - Smoke and full-run configs preserve Docker task images by default so offline image preflight and targeted reruns can reuse them.
- Analyzer scripts separate harness/setup/provider failures from ordinary reward-0 scored failures.
Deadline finalization
Eval runs can set a turn deadline. Roder reserves a finalization window, prompts the model to stop opening new work, and disables tools for the final answer pass. When a task ledger is required, the runtime asks for scoreable output checkpoints and ledger completion before finalizing.
Command behavior
exec_command and shell use the turn deadline to cap effective
process timeouts. Command output is formatted consistently, UTF-8 truncation is safe,
and timeout metadata is returned with the tool result so eval analysis can classify
the failure.
Plan-first reruns
The Harbor adapter can run a planning turn first, store roder-plan.md and
its event/stderr artifacts, then resume the same thread for the implementation turn.
This mode is targeted at tasks where planning, artifact hygiene, or policy framing is
likely to matter; it is not the default for every full run because it adds wall time.
Current benchmark signal
The latest documented strict-medium Terminal-Bench run was Harbor-clean across 89 trials and improved from 47 to 50 passes after deadline and reliability work. The plan-first xhigh rerun converted four previously failing tasks in a 28-task subset.