Getting started

Eval harness

The Harbor harness runs Roder against Terminal-Bench tasks with a prebuilt Linux binary, isolated config/auth directories, JSON event capture, and structured run summaries. It is a first-class consumer of the runtime, not a separate benchmark-only agent. Current pass/fail results live on the eval scoreboard.

Run shape

Harbor launches roder exec --json --profile eval inside each task container.
The adapter records roder-events.jsonl, final assistant text, stderr diagnostics, and roder-run-summary.json.
Smoke and full-run configs preserve Docker task images by default so offline image preflight and targeted reruns can reuse them.
Minimal parity configs can pin [tools].allowlist to the native coding surface, including unified_exec, freeform apply_patch, and view_image.
Analyzer scripts separate harness/setup/provider failures from ordinary reward-0 scored failures.

Deadline finalization

Eval runs can set a turn deadline. Roder reserves a finalization window, prompts the model to stop opening new work, and disables tools for the final answer pass. When a task ledger is required, the runtime asks for scoreable output checkpoints and ledger completion before finalizing.

Command behavior

exec_command and shell use the turn deadline to cap effective process timeouts. Command output is formatted consistently, UTF-8 truncation is safe, and timeout metadata is returned with the tool result so eval analysis can classify the failure.

The eval profile also enables loop-persistence defaults that keep a task moving after recoverable tool-failure loops or empty no-tool finalization. The safeguards remain bounded by per-turn tool-failure, model-call, and tool-round caps.

Plan-first reruns

The Harbor adapter can run a planning turn first, store roder-plan.md and its event/stderr artifacts, then resume the same thread for the implementation turn. This mode is targeted at tasks where planning, artifact hygiene, or policy framing is likely to matter; it is not the default for every full run because it adds wall time.

Current benchmark signal

The latest documented full run is a non-submittable Terminal-Bench 2.1 local development pass with codex/gpt-5.5 at xhigh reasoning. Harbor reported 53 reward-1 tasks and 36 reward-0 scored failures across all 89 tasks, with 0 Harbor exceptions. The local analyzer marked the run clean with no harness error classes. It used an access-token-only auth file and an agent timeout multiplier to keep Roder finalization inside Harbor's outer timeout, so it is trajectory evidence rather than a leaderboard-valid submission.

The newest targeted parity slice covers four tasks that Codex passed while the previous minimal Roder setup failed. With the native Codex-parity tool surface and eval-loop persistence enabled, all four scored reward 1.0; two still reported Harbor AgentTimeoutError exceptions, so the result is published as targeted trajectory evidence rather than a clean full-suite result.