All use cases
Use case · Evals

Evals & QA

Run an agent across a dataset, replay any case, and diff outputs between harnesses or model versions to see what actually got better.

Evals & QA
0:24 / 1:12
Preview

What Froots gives you for evals

Every session, tool call, and message snapshot is persisted in the local database. The sessions table holds messages_json per run; agent_tool_calls records every tool invocation. That’s the substrate evals work on.

The shape of an eval

  • Build a list of prompts (a markdown file works; one per line, or one per file).
  • Run a routine with Interval scheduling that picks the next un-evaluated prompt, runs it, and writes the output to workspace/evals/{run-id}/{case}.md.
  • Pull the resulting set of files into a comparison view, or have a second routine grade them.

What you can diff

  • Across backends — same prompt against Claude vs Codex vs Pi.
  • Across models — switch backend model and re-run.
  • Across skills — toggle a skill off and see which cases regress.
  • Across permissionsCareful vs Balanced vs Yolo produces different shapes of run.
Replay any case. Sessions are stored intact. Open one, scroll the tool calls, and rerun the prompt with one change — same memory, same skills, different model. That’s the eval loop.

What’s missing

A first-class eval UI isn’t shipped — today this is a workflow built on top of routines and the sessions store. If you want it polished, it’s on the roadmap.