selftune logo selftune
← Back to Blog
agent skill evals
skill evaluation
claude code skill testing
evals for skills
braintrust skills
skill quality measurement

Stop Shipping Skills on Vibes

Daniel Petro ·

You wrote a skill on a Tuesday afternoon. You tested it with three or four queries. It worked. You shipped it.

Six weeks later, the model gets updated. Your skill stops triggering on half the queries it used to catch. No one notices. Users file bugs against the agent, not the skill. You find out months later — if at all.

This is how most agent skills are built. Not because skill authors are careless, but because the entire ecosystem ships skills the same way the rest of AI ships features: on vibes.

I'm calling it that because Jessica Wang from Braintrust just gave a keynote with that exact title — Stop Shipping on Vibes — and watching it, I couldn't stop translating every word from "coding agents" to "agent skills." Her argument is devastating, and it applies to skills more forcefully than to the agents she was talking about.

Here's the argument, and here's what it means for anyone writing skills.

The Vibes Problem, in One Sentence

Jess put it like this: most teams ship AI features because an engineer says it looks ready, or a PM tries a handful of prompts and nods. That's the entire shipping decision. No dataset. No scorer. No experiment. Just vibes.

The fix is to make evals a first-class part of the development loop — datasets, tasks, scorers, experiments, production logging — so you can say specific, falsifiable things like "we ran 200 test cases, 94% passed" or "this change improved tone but dropped accuracy 5%."

Those sentences are the difference between engineering and hoping.

If you've shipped anything on top of an LLM, you've probably done the vibes thing. I have. The uncomfortable part is realizing how much of the agent ecosystem runs on this — and skills are the layer where it's worst.

Skills Are the New Prompt. And They Ship Naked.

A skill is a markdown file with a description, some instructions, maybe a few scripts. It sits in .claude/skills/ or gets distributed through a package manager. When the agent sees a query that looks like it matches, the skill loads into context and the agent follows its instructions.

That's the happy path. The unhappy paths cascade:

  • The skill stops triggering after a model update. Silent regression.
  • The skill triggers on queries it shouldn't, derailing unrelated tasks. Users blame the agent.
  • The skill works fine in English, silently fails in Japanese. The team never tests in Japanese, so the bug lives forever.
  • Two skills could plausibly match a query and the wrong one wins. No one knows because no one's measuring.

Every failure mode Jess described for coding agents — drift, regression, non-determinism, language blind spots — applies to skills. The difference is that skills sit lower in the stack than the agent features people are already under-testing. A skill failure cascades into every agent that loads it.

And here's the part that should keep skill authors up at night: most skills ship with zero tests. Not bad tests. Not flaky tests. Zero. The README is the spec. The author's memory is the regression suite. "It worked when I wrote it" is the acceptance criterion.

This is shipping on vibes, one abstraction layer deeper than what Jess was warning about.

Jess's Four-Part Loop, Translated to Skills

The clearest moment in Jess's talk was her framing of what an eval actually is. Four parts: dataset, task, scorer, experiment. Run one configuration and you have one experiment. Change anything, rerun, compare. That's the whole game.

Here's what each part looks like when the thing you're evaluating is a skill, not an agent.

Dataset: the queries you should have tested but didn't

Jess's rule for datasets: golden use cases, edge cases, failure modes. For skills, the dataset is a collection of queries — the things a user might say that should (or shouldn't) trigger the skill.

For a commit skill, your dataset might include:

  • "commit my changes" — must trigger
  • "save this" — should probably trigger
  • "make a commit with a good message" — must trigger
  • "what's in my last commit?" — must NOT trigger (read, not write)
  • "git status" — must NOT trigger

Edge cases are where skills die. Queries in languages the author didn't test. Queries with typos. Queries where two skills could plausibly match and only one should win. None of that surfaces in three Tuesday-afternoon test queries.

The best source of dataset rows is the same place Jess recommends for agents: real production logs. Sample 10-20% of actual agent sessions, find the moments where your skill was loaded (or should have been), and turn those into test cases. This is the exact feedback loop selftune is built around.

Task: the thing most skill evals get wrong

For an agent, the task is a prompt and a model. For a skill, the task is two questions stacked:

  1. Did this query load the skill?
  2. Did following the skill's instructions produce the right behavior?

Skill evals have to test both. A perfect skill that never gets loaded is a broken skill. A correctly-loaded skill whose instructions are vague is also a broken skill. You need trigger accuracy AND execution quality. Both have to pass.

This is the part most skill authors don't realize they're skipping. They test the instructions in isolation by reading them and nodding. They never test whether the routing layer — the agent's decision about which skill to load — actually matches their intent.

Scorer: where you have to commit to a definition of "good"

Jess was emphatic about this: the scorer is where you, the human, define what good means. It's the hardest creative work in evals, and no framework will do it for you.

For skills, your scorer needs to cover four things:

  • Trigger precision — did the right skill load on the right query? (Deterministic.)
  • Trigger recall — did queries that should have matched actually match? (Deterministic.)
  • Execution quality — did following the instructions produce the intended outcome? (LLM-as-judge against a rubric you write in advance.)
  • Collateral damage — did loading this skill break unrelated behavior? (This is the one everyone forgets.)

The execution-quality scorer is where most skill authors bail. Writing the rubric feels harder than writing the skill did. But Jess's point holds: if you can't articulate what good means, you can't ship with confidence. You can only ship with hope.

Experiment: the loop you've never actually run

An experiment is one run of one configuration. Tweak the description, rerun, compare to the previous run. This is where Braintrust's whole product lives, and it's also where selftune's does. Both projects exist because humans are bad at reading long traces and noticing that yesterday's edit quietly dropped trigger accuracy by 8%.

The discipline Jess called out — and she's right — is running multiple trials per test case. LLMs are non-deterministic enough that the same eval, run twice, can swing 10-15%. If your confidence comes from a single run, you don't have confidence. You have a number.

Most skill authors have never run their skill twice on the same query. They've certainly never averaged the results.

What Jess's Talk Gave Me That I Can't Unsee

Vector proximity isn't understanding

The experiment Jess walked through compared agentic search against vector search on real code bugs. Agentic search won — not overwhelmingly, but clearly — and the reason applies directly to how skills are designed.

Vector search gave the agent proximity to relevant code but not connective tissue. Chunks came back without their imports, without the functions that called them, without the types they depended on. The agent got close, then had to guess.

Skills fail the same way when their descriptions are written for embedding similarity instead of human intent. A description stuffed with keywords matches a lot of queries semantically but misses the ones where the user phrased things the author didn't anticipate. The fix isn't more keywords. It's testing against real queries — Jess's dataset discipline — and rewriting descriptions when they miss.

Solo skill authors are three conflicts of interest in a trench coat

Jess made a slide about how evals are a team sport: AI engineers bringing data, PMs defining success criteria, subject matter experts labeling edge cases, data analysts interpreting results. One person can't do all four roles well.

Skills have the same problem in miniature, and worse. The person who writes a skill is usually also the person who tests it, grades it, and decides when it's good enough. That's three conflicts of interest stacked on top of each other. You will grade your own work generously. You will not notice your own blind spots. You will ship.

The practical fix for solo skill authors: outsource the scoring to an LLM-as-judge against a rubric you wrote in advance. Not perfect, but it breaks the conflict of interest and forces you to commit to a definition of good before you see the results. Write the rubric first. Run the eval. Read the failures. Iterate.

Failure visibility is the highest-leverage thing you can build

The part of Jess's talk I replayed twice was the bit about traces. She spent days getting Claude Code's subprocess traces to attach to the parent span, just so she could see what the agent was actually doing on each run. Without that visibility, she said, she couldn't debug why specific test cases failed. With it, the whole eval became tractable.

This is the single biggest gap in how most skill authors work today. They see the aggregate — "it usually works" — but not the individual failures. They can't click into the one query that regressed and see exactly which part of the skill's instructions the agent ignored. Without that, every edit is a guess.

Anything that closes this gap — trace capture, per-query transcripts, regression diffs between skill versions — buys more leverage per hour of work than almost anything else you could build.

The Part That Should Make You Uncomfortable

Jess ended her own talk by saying she wouldn't publish a blog post about her eval results yet. Her gut told her the data was too clean. Agentic search winning 100% of the head-to-head was exactly the kind of conclusion that should trigger skepticism, not celebration.

Sit with that for a second. She ran the experiment, got results that supported the thesis she walked in with, and still refused to publish. That instinct — refuse to ship the claim until the eval is actually sound — is the thing I most want skill authors to internalize. Not the framework. Not the tooling. The instinct.

If you wrote a skill and your only evidence it works is that it worked the first time you tried it, you don't know if it works. You have a hypothesis and a hope.

Build the dataset. Write the scorer. Run the experiment. Run it again. Then ship.

Stop shipping skills on vibes.


Thanks to Jessica Wang and the Braintrust team for the talk that prompted this post. Braintrust evaluates agents; selftune evaluates the skills that shape them. Same loop, different layer of the stack. The industry needs both.