Getting started

Eval harness

The Harbor harness runs Roder against Terminal-Bench tasks with a prebuilt Linux binary, isolated config/auth directories, JSON event capture, and structured run summaries. It is a first-class consumer of the runtime, not a separate benchmark-only agent. Current pass/fail results live on the eval scoreboard.

Run shape

  • Harbor launches roder exec --json --profile eval inside each task container.
  • The adapter records roder-events.jsonl, final assistant text, stderr diagnostics, and roder-run-summary.json.
  • Smoke and full-run configs preserve Docker task images by default so offline image preflight and targeted reruns can reuse them.
  • Analyzer scripts separate harness/setup/provider failures from ordinary reward-0 scored failures.

Deadline finalization

Eval runs can set a turn deadline. Roder reserves a finalization window, prompts the model to stop opening new work, and disables tools for the final answer pass. When a task ledger is required, the runtime asks for scoreable output checkpoints and ledger completion before finalizing.

Command behavior

exec_command and shell use the turn deadline to cap effective process timeouts. Command output is formatted consistently, UTF-8 truncation is safe, and timeout metadata is returned with the tool result so eval analysis can classify the failure.

Plan-first reruns

The Harbor adapter can run a planning turn first, store roder-plan.md and its event/stderr artifacts, then resume the same thread for the implementation turn. This mode is targeted at tasks where planning, artifact hygiene, or policy framing is likely to matter; it is not the default for every full run because it adds wall time.

Current benchmark signal

The latest documented strict-medium Terminal-Bench run was Harbor-clean across 89 trials and improved from 47 to 50 passes after deadline and reliability work. The plan-first xhigh rerun converted four previously failing tasks in a 28-task subset.