Sundae Bar Logo
December 30, 2025

Anatomy of a Test Case: How SN121 Evaluates Agent Responses

By Taylor Sudermann
SN121 Dev Log

Every evaluation in SN121 starts with a test case. But what's actually inside one?

This post breaks down the structure of a test case — from the prompt an agent receives, to the ground truth and rubric, to how validators apply scoring and produce a final result.

Understanding this structure is key to understanding how SN121 makes agent performance observable, comparable, and reproducible.

The Structure

Each test case in the SN121 dataset contains everything needed to evaluate an agent's response:

Input — The prompt the agent receives, including the task description, any JSON schema it must follow, and the source context.

Ground Truth — The expected output and metadata describing how the answer should be evaluated.

Metadata — The evaluation contract: scoring categories with weights, explicit penalties, and acceptable variations.

Let's look at a real example.

Example: Test 002

Domain: Operations Skill: Extraction Difficulty: Medium

The Input

This is what the agent receives:

From: operations@company.com To: sam@company.com Subject: End-of-quarter tasks
Hi Sam, We need to wrap up a few things before the quarter ends. First, please compile the sales report for Q3 by October 12th. Second, coordinate with Finance to ensure all invoices are processed by the 15th. Also, let's schedule a meeting with the warehouse team next week (no specific date yet) to review inventory.
Thanks, Operations Team

The agent is also given a JSON schema it must follow:

{
"type": "array",
"items": { "type": "object",
"properties": {
"task": { "type": "string" },
"assignee": { "type": "string" },
"due": { "type": ["string", "null"] }
},
"required": ["task", "assignee", "due"],
"additionalProperties": false
}
}

The task is clear: extract action items from the email into a structured JSON array. Each item needs a task, assignee, and due date (which may be null if no date is specified).

The Ground Truth

This is the expected output:

[
{
"task": "compile the sales report for Q3",
"assignee": "Sam",
"due": "2025-10-12"
},
{
"task": "coordinate with Finance to process invoices",
"assignee": "Sam",
"due": "2025-10-15"
},
{
"task": "schedule meeting with warehouse team",
"assignee": "Sam",
"due": null
}
]

Notice the dates are in ISO format (2025-10-12), not the natural language format from the email ("October 12th"). This matters.

The Rubric

The rubric defines how to score the response. Each category has a weight:

Task Completion (0.4) — Were all tasks extracted from the email?

Schema Adherence (0.3) — Is the output valid JSON with correct fields and no extras?

Retrieval Accuracy (0.2) — Are the dates and assignee correct?

Clarity (0.1) — Is the task phrasing concise?

Task completion is weighted highest — missing a task is worse than slightly awkward phrasing.

Penalties

Penalties reduce scores for specific failure modes:

  • Hallucinating tasks that aren't in the email
  • Adding non-requested fields to the JSON
  • Incorrect ISO date formatting
  • Pluralizing or altering tasks beyond their meaning

Acceptable Variations

Not every difference from the ground truth is a mistake:

  • Slight rephrasing of task content is fine
  • "Warehouse team meeting" phrasing may vary

A Real Evaluation Result

Here's what actually happens when a validator scores an agent response to this test.

The agent understood the task. It extracted all three action items correctly. The task phrasing was concise and professional. But there was a problem: the agent output the dates as "October 12th" instead of "2025-10-12".

The grading model (DeepSeek-V3 via Letta Evals) returned:

Score: 0.75000

Rationale: "The submission correctly extracts all tasks and assignee information, demonstrating strong task completion and clarity. However, it fails to format the dates in ISO format (e.g., '2025-10-12' instead of 'October 12th'), which is a significant penalty in schema adherence and retrieval accuracy. The task phrasing is concise and professional, aligning well with the clarity criteria."

This is exactly what the rubric is designed to catch. The agent understood the task — but evaluation isn't just about understanding. It's about whether the output is safe for systems to consume. A downstream system expecting ISO dates would break on "October 12th".

That's why this scored 0.75 instead of 1.0.

Why This Structure Matters

Evaluation isn't just "right or wrong"

A binary pass/fail tells you nothing useful. The weighted rubric reveals where performance breaks down — was it task completion? Relevance? Tone?

Penalties catch production failures

In production, an agent that produces three items when asked for two is broken — even if all three items are good. The penalty system enforces this.

Acceptable variations prevent false negatives

Language is flexible. Multiple phrasings can be equally correct. The rubric accounts for this without sacrificing precision.

Every score is traceable

When a validator returns 0.75000, you can trace it back: which categories scored low? Which penalties were applied? What was the rationale?

The Evaluation Contract

Think of each test case as a contract:

  • The input defines what the agent must do
  • The ground truth defines what success looks like
  • The rubric defines how success is measured
  • The penalties define what failure looks like
  • The acceptable variations define the boundaries of correctness

This contract is explicit. Nothing is hidden. When an agent is evaluated, everyone — the developer, the validator, the network — can see exactly how the score was produced.

Explore the Dataset

The full dataset and rubric from the Preview Challenge are available to download. You can see every test case, every rubric, every penalty.

Explore the Preview Challenge →