Anthropic Uses Hundreds of Skills Internally. Here's What They Learned.

Anthropic just published their internal playbook for building and distributing skills at scale inside Claude Code.

They've got hundreds of skills in active use. They've cataloged them, figured out what works, and shared the patterns. It's one of the most useful things they've put out for skill developers.

We read it carefully. Here's what actually matters — and what it changes about how you should think about your skills.

The description field is not a summary

This one line from the post is worth more than the rest of it:

"The description field is not a summary — it's a description of when to trigger this skill."

Every Claude Code session starts with the model scanning your skill descriptions to decide which ones are relevant. If your description reads like a product tagline, the model skips your skill — even when it should use it.

This is what skill developers call undertriggering. Anthropic uses that term too. Your skill exists, it works, but it never fires because the description doesn't match how users actually phrase their requests.

selftune detects undertriggering automatically. It watches real sessions, spots the moments where your skill should have been called but wasn't, and tells you exactly what to change in the description. Most skills improve their trigger rate within one revision cycle.

The Gotchas section is the highest-signal content in any skill

Another direct quote:

"The highest-signal content in any skill is the Gotchas section."

Not the overview. Not the examples. The Gotchas.

This makes sense once you think about it. Claude already knows general coding patterns. The only things worth putting in a skill are the things that push the model away from its defaults — the edge cases, the footguns, the wrong turns it takes when left to its own judgment.

A skill with no Gotchas section is a skill that hasn't been used yet. Real usage creates real failure patterns. Documenting those failures is how a skill matures from "technically correct" to "actually reliable."

selftune builds your Gotchas section from evidence. It analyzes execution sessions, finds the recurring failure patterns, and surfaces evolution proposals — including specific Gotcha entries — based on what the agent actually did wrong. You don't have to remember to update the skill. The system does it.

Skills are folders, not markdown files

A common mistake is treating a skill as a single SKILL.md file. Anthropic is explicit that the best skills use the entire folder structure:

references/ for API docs and code snippets the model should read on demand
assets/ for templates it should copy and fill in
scripts/ for tools it should run rather than reimplement from scratch

They call this progressive disclosure — the model reads what it needs, when it needs it, instead of getting everything dumped into context at once.

If your skill is a single markdown file, you're leaving a lot of reliability on the table. The agents that know how to use a folder of well-organized resources outperform the ones working from a wall of text.

The curation problem is real — and unsolved

Here's the part of the post that stood out most:

"It can be quite easy to create bad or redundant skills, so making sure you have some method of curation before release is important."

Anthropic identifies this as the key challenge for internal skill marketplaces. How do you prevent bad skills from spreading? How do you find the ones that are actually high quality?

They don't offer a technical solution. They rely on social processes — Slack posts, sandbox folders, PR reviews. That works at Anthropic. It doesn't scale to a community.

selftune is the quality layer that makes skill marketplaces trustworthy. Every skill gets a quality score. The score is based on observed behavior, not documentation. A skill that looks polished but fails in production gets a low score. A simple skill that triggers reliably and executes correctly gets a high one.

When you install a skill from a marketplace, you want to know it works. That's what selftune tells you.

What this means for your skills right now

Three things you can do today based on Anthropic's lessons:

1. Rewrite your description as a trigger condition, not a summary. Ask yourself: "Would Claude recognize this as the right skill if a user typed [common request]?" If the answer is no, rewrite the description to match the language your users actually use.

2. Add a Gotchas section if you don't have one. Even if you haven't observed failures yet, think through the edge cases. What would the model get wrong without explicit guidance? What's the default behavior you're trying to override?

3. Move reference material into files. If your skill has a long list of API signatures, examples, or code patterns, pull them into a references/ subfolder and point Claude to read them when needed.

Or run selftune ingest on your skill and let the system tell you what needs improving. That's what it's for.

selftune is an open-source tool for improving agent skills. Install it with npm install -g selftune and run selftune ingest to get your first quality report.