OpenAI Just Proved That Skill Descriptions Matter More Than Skill Logic

Earlier this month, Kazuhiro Sera from OpenAI published a blog post about how the Agents SDK team uses skills to maintain their Python and TypeScript SDKs. The numbers are striking: between December 2025 and February 2026, their two SDK repos merged 457 PRs — up 44% from 316 in the prior quarter — with the same team size.

But buried in the practical details of their setup is something more important than the throughput numbers. It's a finding that validates the entire premise selftune was built on.

The finding

"Writing better skill descriptions improved routing accuracy more than any change to the underlying skill logic itself."

Read that again. OpenAI's team — maintaining some of the most actively developed open source repos in the AI ecosystem — discovered through hands-on experience that the description of what a skill does matters more than what the skill actually does when it comes to whether an agent uses it correctly.

This isn't a theoretical claim. It's an operational insight from a team processing hundreds of PRs per quarter.

Why descriptions are the bottleneck

The OpenAI blog describes their skill architecture clearly: each skill has metadata (a description and trigger conditions), a full SKILL.md with detailed instructions, and scripts that do the deterministic work. Their agent, Codex, reads the description first. If the description doesn't match the current context, the full instructions never get loaded.

They call this "progressive disclosure" — the description is a routing signal, not documentation. It determines whether the skill gets activated at all.

This means a perfectly implemented skill with a vague description is worse than a mediocre skill with a precise description. The first one never fires. The second one at least tries.

Their original description for a verification skill was:

"Run the mandatory verification stack."

It didn't route well. The improved version:

"Run the mandatory verification stack when changes affect runtime code, tests, or build/test behavior."

Same skill. Same logic. Better description. Better routing.

What OpenAI does manually, selftune automates

The OpenAI team refined their descriptions by hand, through trial and error, over months of real usage. They noticed when skills didn't fire, investigated why, and rewrote descriptions until routing accuracy improved.

This is exactly the loop selftune automates:

Watch real agent sessions to detect when skills should have fired but didn't (false negatives) or fired when they shouldn't have (false positives)
Grade session quality to understand whether skill activations actually helped
Extract patterns from failures — what trigger words were users actually saying?
Evolve descriptions to match how users actually talk, not how developers think they'll talk
Validate proposed changes against an eval set before deploying
Monitor post-deploy to catch regressions

OpenAI discovered the problem through operational pain. selftune treats it as a continuous optimization loop.

The parallel architecture

What's remarkable is how closely OpenAI's skill architecture mirrors what selftune already expects. Here's a side-by-side:

Concept	OpenAI's Implementation	selftune's Equivalent
Skill descriptions as routing signals	Metadata in `.agents/skills/`	SKILL.md routing table + trigger keywords
Progressive disclosure	Description → full SKILL.md	SKILL.md → Workflows/*.md
Narrow contracts	One skill, one job	Grading evaluates trigger precision
Model-script boundary	Model judges, scripts execute	CLI handles deterministic work, agent handles judgment
If/then enforcement	AGENTS.md mandates skill usage	SKILL.md routing table with trigger conditions
Two-stage validation	Script execution + model comparison	hooks-to-evals + grading pipeline
Local → CI graduation	Prove locally, then Codex GitHub Action	watch → confidence threshold → deploy

These teams arrived at the same architecture independently. That's not coincidence — it's convergent evolution driven by the same underlying constraint: agents need structured metadata to make good routing decisions, and that metadata needs to evolve based on real usage.

The 44% number

OpenAI's throughput increase — 457 PRs versus 316, a 44% jump — is compelling. But it's a team-level metric that's hard to attribute to any single change.

selftune doesn't yet surface aggregate productivity metrics like this. It tracks per-skill health (pass rates, trigger accuracy, false negative counts), which is more granular but less compelling as a headline number. This is something we're thinking about — how do you measure the compound effect of better skill routing across an entire workflow?

The honest answer is that we don't know yet. But OpenAI's numbers suggest the effect is real and significant.

What this means for skill authors

If you're building skills for AI agents — whether for Claude Code, Codex, or any other agent framework — the lesson from OpenAI's experience is clear:

Invest more time in your descriptions than your logic.

Specifically:

Include trigger context, not just capability. "Verify code changes" is worse than "Verify code changes when runtime code, tests, examples, or build behavior is modified."
Use the words your users actually use. If users say "check my PR" but your description says "execute verification pipeline," you have a routing gap.
Treat descriptions as living documents. They should evolve based on real usage data, not sit unchanged after initial authoring.
Measure routing accuracy. If you don't know how often your skill fires correctly versus incorrectly, you're flying blind.

Or just install selftune and let it handle the evolution for you. That's what it's built for.

The bigger picture

We're still early in understanding how agents use skills effectively. OpenAI's blog is one of the first detailed accounts from a major AI company about the operational reality of skill-based agent workflows at scale.

The core insight — that metadata quality determines routing quality, which determines whether skills deliver value at all — should shape how the entire ecosystem thinks about skill development. Building a great skill isn't enough. You need a great description, and that description needs to match how people actually work.

This is a continuous optimization problem, not a one-time authoring task. And continuous optimization is exactly what selftune does.