Use case · Evals

Evals & QA

Run an agent across a dataset, replay any case, and diff outputs between harnesses or model versions to see what actually got better.

Evals & QA

0:24 / 1:12

Preview

What Froots gives you for evals

Every session, tool call, and message snapshot is persisted in the local database. The sessions table holds messages_json per run; agent_tool_calls records every tool invocation. That’s the substrate evals work on.

The shape of an eval

Build a list of prompts (a markdown file works; one per line, or one per file).
Run a routine with Interval scheduling that picks the next un-evaluated prompt, runs it, and writes the output to workspace/evals/{run-id}/{case}.md.
Pull the resulting set of files into a comparison view, or have a second routine grade them.

What you can diff

Across backends — same prompt against Claude vs Codex vs Pi.
Across models — switch backend model and re-run.
Across skills — toggle a skill off and see which cases regress.
Across permissions — Careful vs Balanced vs Yolo produces different shapes of run.

Replay any case. Sessions are stored intact. Open one, scroll the tool calls, and rerun the prompt with one change — same memory, same skills, different model. That’s the eval loop.

What’s missing

A first-class eval UI isn’t shipped — today this is a workflow built on top of routines and the sessions store. If you want it polished, it’s on the roadmap.

Evals & QA

What Froots gives you for evals

The shape of an eval

What you can diff

What’s missing

Personal assistant