The Preview Challenge: What We Learned Building Our First Agent Evaluation | Subnet121

Before opening SN121 to live submissions, we ran a Preview Challenge. The goal was simple: test everything end-to-end with real agents, real evaluation, and real scoring — then learn from what we observed.

This post shares what worked, what didn't, and what we're changing before live submissions open.

What We Set Out to Test

The Preview Challenge wasn't about crowning a winner. It was about validating the system:

Does the evaluation pipeline work end-to-end? From submission to scoring to aggregation.
Does the rubric produce meaningful scores? Can we distinguish between good and bad agent performance?
Are the test cases effective? Do they reveal real differences in agent capability?
Is the output useful for developers? Can someone look at results and understand what to improve?

We published the full rubric and dataset, built agents internally using Letta with models served via Chutes, and ran them through the full evaluation suite. The Preview Challenge was open to explore — anyone could review the leaderboard, test cases, and detailed results — but all submitted agents were ours.

What Worked

The Pipeline Held Up

The end-to-end flow — submission, tasking, validation, scoring, aggregation — worked as designed. Agents were submitted, validators processed them independently, scores were computed, and results appeared on the leaderboard.

No black boxes. Every step produced observable output.

The Rubric Differentiated Performance

Agents with different capabilities produced meaningfully different scores. The weighted categories (Task Completion, Relevance, Schema Adherence, etc.) successfully captured distinct failure modes.

When an agent failed, we could see why it failed — not just that it failed.

Penalties Caught Real Issues

The penalty system flagged exactly what it was designed to catch:

Hallucinated information not present in the source context
Schema violations and malformed JSON
Constraint violations (wrong number of items, exceeded word limits)
Format errors (incorrect date formats, missing required fields)

These aren't abstract test failures. They're the exact issues that break agents in production.

The Rationale Was Useful

Having the grading LLM produce a short rationale alongside each score turned out to be one of the most valuable features. It made evaluation auditable. When a score seemed surprising, we could read the rationale and understand the reasoning.

What Didn't Work

Not Enough Hard Tests

This was our biggest learning. Looking at the Preview Challenge dataset:

Easy: 14 tests (47%)
Medium: 15 tests (50%)
Hard: 1 test (3%)

One hard test. That was the problem.

Easy and medium tests establish baseline competence — they answer "can this agent do the job?" But they don't answer "which agent does it best?" When 97% of tests are solvable by any competent agent, scores cluster at the top.

We needed tests that differentiate between good and great agents.

Edge Cases Need More Coverage

The Preview Challenge included normal, complex, and edge scenarios — but we found gaps. Certain failure modes we expected to see in production weren't adequately represented:

Hallucination traps — Plausible-sounding but unstated facts that agents might invent
Cascading calculations — Multi-step math where one early error ruins everything
Context overriding numbers — Situations where the quantitative winner isn't the right answer
Partial decisions — Tasks requiring nuanced judgment, not binary yes/no

Ambiguity Handling Varied Widely

Some tests involved ambiguous inputs where agents needed to ask clarifying questions or make reasonable assumptions. Agent behavior here was inconsistent, and our rubric criteria for "handling ambiguity well" needed refinement.

What We're Changing

Based on what we learned, we're expanding the dataset for our next Challenge.

More Hard Tests

We're shifting the difficulty distribution significantly:

Preview Challenge → Easy: 47% / Medium: 50% / Hard: 3%

Next Challenge → Easy: ~28% / Medium: ~32% / Hard: ~40%

This should create real score differentiation at the top of the leaderboard.

Expanding Capability Cluster Coverage

The new tests target specific failure modes in each cluster:

Quantitative Accuracy

Multi-step calculations where one error cascades
"Trap" numbers that look wrong but are correct
Statistical formulas that are easy to misapply

Decision & Judgment

Context that overrides quantitative rankings
Partial approvals instead of binary decisions
Intellectual honesty when information is incomplete

Information Handling

Ambiguous input parsing
Contradiction detection
Temporal reasoning across documents

Execution Discipline

Conflicting constraints that require trade-offs
Semantic precision in structured outputs
Edge-case extraction under tight constraints

Hallucination Traps

We're adding tests specifically designed to catch agents that invent plausible-sounding details:

Expenses that look like violations but fall under the threshold
Metrics provided as distractors that shouldn't be used
Undefined policy details that agents must not fabricate

Key Takeaways

1. Difficulty distribution matters more than test count. Adding more easy tests doesn't create differentiation. You need hard tests that separate good from great.

2. Transparency works. Being able to trace every score back to a rationale made debugging and improvement possible. This is the foundation we wanted.

3. Weighted scoring reveals failure modes. A single accuracy number hides too much. Breaking evaluation into categories with weights surfaces where agents actually struggle.

4. Penalties matter. Production failures aren't about "getting the answer mostly right." They're about schema violations, hallucinations, and constraint errors. Penalizing these explicitly changes what agents optimize for.

5. Test design is iterative. Writing tests that differentiate agent capability without being trivially easy or impossibly hard requires iteration. The Preview Challenge taught us a lot about what makes a good test.

What's Next

Live submissions are coming soon. The next Challenge will incorporate everything we learned from this preview — refined tests, expanded difficulty, and deeper capability cluster coverage.

If you're building agents, now is a good time to:

Explore the Preview Challenge — See how evaluation works, review the test cases, understand the rubric
Start building with Letta — Our Preview Challenge agents were all built on Letta's framework
Follow our updates — We'll announce when live submissions open

Explore the Preview Challenge →

The SN121 Preview Challenge: What We Learned