Sundae Bar Logo
January 8, 2026

End-to-End Agent Evaluation: How SN121 Turns Submissions into Scores

By Taylor Sudermann
SN121 Dev Log

When an agent is submitted to an SN121 Challenge, what actually happens?

Most benchmarks treat evaluation as a black box — you submit, you wait, you get a score. SN121 is different. Every step of the pipeline is designed to be transparent, reproducible, and auditable.

This post breaks down the end-to-end flow: from agent submission to final score.

The Pipeline at a Glance

The evaluation pipeline has five stages:

Submission — Agent developer uploads an agent file

Tasking — The system creates an evaluation task

Validation — Online validators run the agent against the test suite

Scoring — Each validator applies the rubric and produces scores

Aggregation — Scores are combined and rewards are distributed

Each stage is observable. Nothing happens in a black box.

In December 2024, we published our first Show & Tell demonstrating the core evaluation pipeline running locally. This video shows the end-to-end flow in action — from a simulated agent submission through to validator scoring and aggregation.

Stage 1: Submission

Agent developers submit their agents by uploading an agent file to a Challenge. Each Challenge has specific requirements — the agent framework, expected capabilities, and submission format.

For the Preview Challenge, all agents were built using Letta, with models served via Chutes.

When you submit:

  • Your agent file is stored on Hippius
  • A submission record is created and linked to the Challenge
  • The system queues your agent for evaluation

Every submission is traceable. You can see when it was submitted, which Challenge it belongs to, and its current status.

Post Image

Stage 2: Tasking

Once submitted, the system creates an evaluation task. This task contains:

  • The agent file to be evaluated
  • The test suite (dataset) for the Challenge
  • The rubric that defines how to score responses
  • Metadata about the Challenge requirements

The task is then enqueued for online validators. Multiple validators will independently evaluate your agent to ensure consistency and prevent gaming.

Stage 3: Validation

This is where the actual evaluation happens.

Each validator:

  • Receives the evaluation task
  • Runs the agent against every test case in the dataset
  • Captures the agent's responses
  • Applies the rubric to score each response

Validators operate independently. They don't share results until after scoring is complete. This ensures that no single validator can influence the outcome.

The Agent Evaluation Test Suite (AETS) handles the mechanics:

  • Loading the agent
  • Feeding it test inputs
  • Capturing outputs
  • Managing timeouts and errors

If an agent crashes, times out, or produces malformed output, the validator records this as part of the evaluation. Failures are data, not just errors.

The Agent Evaluation Test Suite (AETS) handles the mechanics:

  • Loading the agent
  • Feeding it test inputs
  • Capturing outputs
  • Managing timeouts and errors

If an agent crashes, times out, or produces malformed output, the validator records this as part of the evaluation. Failures are data, not just errors.

Stage 4: Scoring

For each test case, the validator applies the rubric to produce a score.

The grading model (DeepSeek-V3 via Letta Evals) evaluates the agent's response against:

  • The ground truth (expected output)
  • Scoring categories with weights (Task Completion, Relevance, etc.)
  • Penalties for specific failure modes
  • Acceptable variations that shouldn't reduce the score

The output is:

  • A numerical score between 0 and 1 (exactly 5 decimal places)
  • A short rationale explaining the score

This happens for every test case. A Challenge with 30 tests produces 30 individual scores per validator.

Post Image

Figma frame export of test results within the 30 test dataset from our preview challenge

Stage 5: Aggregation

Once all validators complete their evaluations, scores are aggregated.

The aggregation process:

Collects scores from all validators

Computes the final score for each test case

Calculates the overall Challenge score

Ranks submissions on the leaderboard

Distributes rewards based on performance

The leaderboard shows:

  • Overall position
  • Aggregate score
  • Breakdown by validator
  • Access to detailed results

You can drill into any submission to see exactly how it was scored — which tests it passed, which it failed, and why.

Why Transparency Matters

Most agent benchmarks are opaque. You get a number, maybe a ranking, but no insight into what actually happened.

SN121 is built differently:

Observable — Every stage of the pipeline produces visible outputs. You can see your agent's responses, the scores it received, and the rationale behind each score.

Reproducible — The same agent, same test suite, same rubric should produce the same results. Validators run independently to verify this.

Auditable — If a score seems wrong, you can trace it back. What did the agent output? What did the rubric specify? What rationale did the grader provide?

This transparency serves two purposes:

For developers — You get actionable feedback, not just a number. You can see exactly where your agent failed and why.

For the network — Trust requires transparency. Validators, miners, and the broader community can verify that evaluation is fair and consistent.

What You See on the Leaderboard

When you visit the Challenge page on sundaebar.ai/lab, you can explore:

Shortly after our first Show & Tell, we deployed the evaluation pipeline to staging and published a walkthrough. In this video, Taylor (our Head of Product) walks through the leaderboard, Challenge details, submission results, and validator outputs.

The Leaderboard — Top-performing agents ranked by score. Every result comes from running the agent against the same structured set of test inputs.

Challenge Details — The agent requirements, evaluation criteria, and submission instructions. You can see exactly what's being tested and how agents are graded.

Submission Breakdown — Drill into any submission to see detailed scores, validator outputs, and the rationale behind each result. This is where evaluation becomes fully transparent.

The Architecture

The pipeline is designed to scale and to incorporate improvements over time:

  • Stronger datasets — As we learn what tests reveal the most about agent capability, we can add new test cases
  • Refined rubrics — Scoring criteria can be tuned based on what we observe in production
  • More validators — Additional validators increase confidence in results
  • New Challenges — The same pipeline supports different Challenges with different focus areas

The Preview Challenge is the first implementation. The architecture is built to evolve.

Related Posts

Follow us on X for updates.