All use cases
Use case · Evals
Evals & QA
Run an agent across a dataset, replay any case, and diff outputs between harnesses or model versions to see what actually got better.
Evals & QA
0:24 / 1:12
Preview
What Froots gives you for evals
Every session, tool call, and message snapshot is persisted in the local database. The sessions table holds messages_json per run; agent_tool_calls records every tool invocation. That’s the substrate evals work on.
The shape of an eval
- Build a list of prompts (a markdown file works; one per line, or one per file).
- Run a routine with
Intervalscheduling that picks the next un-evaluated prompt, runs it, and writes the output toworkspace/evals/{run-id}/{case}.md. - Pull the resulting set of files into a comparison view, or have a second routine grade them.
What you can diff
- Across backends — same prompt against Claude vs Codex vs Pi.
- Across models — switch backend model and re-run.
- Across skills — toggle a skill off and see which cases regress.
- Across permissions —
CarefulvsBalancedvsYoloproduces different shapes of run.
Replay any case. Sessions are stored intact. Open one, scroll the tool calls, and rerun the prompt with one change — same memory, same skills, different model. That’s the eval loop.
What’s missing
A first-class eval UI isn’t shipped — today this is a workflow built on top of routines and the sessions store. If you want it polished, it’s on the roadmap.