selftune logo selftune
← Back to Blog
how to create agent skills
agent skill best practices
SKILL.md tutorial
writing testable skills

How to Write Good Agent Skills: 8 Best Practices

Daniel Petro ·
How to Write Good Agent Skills: 8 Best Practices

Most agent skills never fire. The author writes them, publishes them, and assumes they work. They don't. The skill sits in a directory, structurally valid, functionally inert. If you want to learn how to create agent skills that actually trigger when users need them, you need to understand why most skills fail — and what the 8 practices below do to fix it.

The core problem is silent failure. When a skill doesn't trigger, nothing breaks visibly. The agent just proceeds without it. No error. No log. No indication that a skill existed and should have been used. Vercel's engineering team measured a 0% invocation rate on a skill with bad description syntax — the skill was deployed for weeks before anyone noticed. Most skill authors never discover their skill doesn't fire because there is no feedback mechanism built into the authoring workflow.

These 8 practices address the full lifecycle: writing descriptions that match real user language, structuring SKILL.md files for reliable parsing, testing with diverse phrases, and monitoring trigger rates after deployment.

1. Write Descriptions That Match How Users Actually Talk

The description field in your SKILL.md is the trigger. It's the text the agent matches against when deciding whether to invoke your skill. If the description doesn't match the way users phrase their requests, the skill won't fire.

Most skill authors write descriptions that describe what the skill does from the author's perspective. Users don't talk that way.

Bad — author-centric, narrow vocabulary:

# database-migration

## Description
This skill manages database migrations and schema updates.

Good — covers the full invocation taxonomy:

# database-migration

## Description
Use when the user asks to update the database schema, run migrations,
create tables, modify columns, add indexes, or manage database changes.
Also use when the user says "the database needs a new field," "add a
column to the users table," or "I need to change the schema."

The difference is coverage. The good description handles three categories of user language: explicit ("run migrations"), implicit ("update the schema"), and contextual ("I need to add a column to users"). A skill that only matches explicit invocations will miss 40-60% of real user queries, based on invocation taxonomy data from selftune analysis across 200+ skills.

2. Include Negative Triggers (When NOT to Fire)

False positives erode trust faster than false negatives. When a skill fires incorrectly, users learn to distrust the system. Negative triggers tell the agent when to suppress the skill.

Without negative triggers:

## Description
Use when the user asks about database operations, queries, or schema changes.

This fires on SELECT queries, read-only operations, and reporting questions — none of which need a migration skill.

With negative triggers:

## Description
Use when the user asks to update the database schema, run migrations,
create or modify tables, add indexes, or manage structural database changes.

Do NOT use this skill for:
- Read-only database queries or SELECT statements
- Database connection or configuration issues
- Query performance tuning
- Backup and restore operations

Negative triggers reduce false positive rates by 30-50% in skills that have broad domain overlap with other skills.

3. Structure SKILL.md with Clear Sections

Agents parse SKILL.md files looking for specific information. A well-structured file makes every section easy to find and unambiguous to interpret.

# deploy-preview

## Description
Use when the user asks to deploy a preview environment, create a staging
build, test a branch deployment, or see changes live before merging.
Do NOT use for production deployments or CI/CD pipeline configuration.

## Trigger Phrases
- "deploy a preview"
- "I want to see this branch live"
- "create a staging environment"
- "can I test this before merging"

## Instructions
1. Check that the current branch has been pushed to the remote.
2. Run `npm run build` to verify the build succeeds locally.
3. Execute `vercel deploy --prebuilt` to create the preview.
4. Return the preview URL to the user.
5. If the build fails, show the error output and suggest fixes.

## Examples

User: "I want to see what this looks like before I merge"
Action: Build and deploy preview, return URL

User: "Deploy this to staging"
Action: Build and deploy preview, return URL

Key structural elements: description with both positive and negative triggers, explicit trigger phrases for disambiguation, numbered instructions for deterministic execution, and examples that show the mapping from user intent to skill behavior.

4. Add Concrete Examples in Your Instructions

Examples reduce ambiguity more than instructions alone. When an agent sees "deploy a preview," the instruction "run vercel deploy" is clear. But when a user says "can I see what this PR looks like," the agent needs examples to map that phrase to the same action.

Include 2-3 example interactions that show diverse user phrasings mapped to the correct skill behavior:

## Examples

User: "Deploy a preview of this branch"
→ Direct match. Run build, deploy preview, return URL.

User: "I want to check if the header looks right on mobile"
→ Contextual match. User wants visual verification, which requires
  a preview deployment. Run build, deploy preview, return URL.

User: "Push this to production"
→ DO NOT trigger. This is a production deployment, not a preview.

The third example is a negative example. Including what the skill should reject is as valuable as showing what it should accept. According to Anthropic's documentation on skill design, concrete examples improve agent reasoning about edge cases where the description alone is ambiguous.

5. Keep Skills Focused (Single Responsibility)

A skill that does one thing well triggers more reliably than a skill that does five things adequately. Broad skills create two problems: they trigger on queries they shouldn't handle (false positives), and their descriptions become too vague to match specific queries (false negatives).

Too broad:

# devops

## Description
Use for any DevOps-related task including deployments, monitoring,
CI/CD, infrastructure, containers, and cloud management.

This skill will conflict with any other skill in the deployment, monitoring, or infrastructure space. It's also too vague to trigger reliably on specific queries like "set up a health check endpoint."

Focused:

# health-check

## Description
Use when the user asks to add a health check endpoint, configure
liveness or readiness probes, or set up uptime monitoring for a
service. Do NOT use for general monitoring dashboards or alerting.

If a skill's description needs more than 5 sentences to cover its scope, it should be two skills.

6. Test with Real User Phrases, Not Just Your Own

You wrote the skill. You know how it works. Your test phrases will reflect your mental model, not your users'. This is the most common reason skills undertrigger in production.

A practical test: ask 5 people who haven't seen the skill to describe the task it handles. Write down their exact words. If your skill description doesn't cover at least 80% of their phrases, rewrite it.

Common gaps this reveals:

  • Jargon mismatch. You write "execute migration." Users say "update the database."
  • Abstraction level. You write "deploy preview." Users say "I want to see what this looks like."
  • Implicit intent. You write "run linter." Users say "is my code clean?"

Each of these is a missed trigger. Multiply by the number of users and the cost compounds.

7. Add Eval Sets for Regression Detection

Eval sets are structured test suites that verify your skill triggers correctly across a defined set of queries. Without them, every edit to your SKILL.md is a gamble — you might fix one trigger and break three others.

A basic eval set covers three query types:

# eval-set.yaml
skill: deploy-preview
queries:
  explicit:
    - "deploy a preview"
    - "create a preview deployment"
    - "run a preview build"
  implicit:
    - "I want to see this branch live"
    - "can I check what this looks like"
    - "show me a staging version"
  contextual:
    - "the client wants to review before we merge"
    - "let me verify the responsive layout"
  negative:
    - "deploy to production"
    - "set up the CI pipeline"
    - "check the deployment logs"

Tools like selftune can generate eval sets from real session data — queries that users already sent, in the language they already used. This closes the gap between synthetic tests and production behavior. Anthropic's skill-creator also supports eval authoring at the authoring-time layer.

Run your eval set after every SKILL.md change. If trigger accuracy drops, the change is a regression regardless of whether it looks correct to you.

8. Monitor Trigger Rates in Production

Testing catches failures you predict. Monitoring catches failures you don't. The distinction matters because most skill failures come from user language you never anticipated.

Three metrics to track:

  • Trigger rate. What percentage of relevant queries actually fire the skill? Below 70% means your description has gaps.
  • False positive rate. What percentage of triggers are incorrect? Above 10% means your description is too broad or missing negative triggers.
  • Execution quality. When the skill fires, does it produce the expected output? A skill that triggers correctly but executes poorly is worse than one that doesn't trigger at all.

Manual monitoring is unsustainable. Automated monitoring is the difference between hoping your skill works and knowing it works.

npx selftune@latest doctor

This runs a diagnostic across your installed skills — checking description coverage, identifying potential trigger gaps, and flagging skills that may be undertriggering based on their description structure. It's the starting point for moving from guesswork to measurement.

How to Test Your Agent Skills

Three testing layers, ordered by effort and coverage:

Manual testing. Describe the task your skill handles in 10 different ways. Vary the vocabulary, abstraction level, and specificity. Check whether the skill fires for each. This takes 15 minutes and catches the obvious gaps.

Eval set testing. Build or generate an eval set (see Practice 7). Run it after every SKILL.md change. This catches regressions that manual testing misses because humans forget to re-test old queries after making new changes.

Baseline testing. Measure whether your skill performs better than no skill at all. Run the same queries with the skill loaded and without it. If the base model handles the task equally well, the skill is adding complexity without value — a signal that either the skill needs improvement or the base model has absorbed its capabilities.

FAQ

How long should a SKILL.md description be?

2-5 sentences. The first sentence covers the primary use case with explicit trigger language. Sentences 2-3 cover implicit and contextual triggers. The final sentence defines negative triggers (when NOT to fire). Shorter descriptions miss triggers. Longer descriptions introduce noise that reduces matching precision.

How many skills is too many?

There is no absolute limit, but the practical ceiling is where skills start conflicting — two skills triggering on the same query. Monitor for conflicts using tools like selftune's composability analysis. If you have 50 skills and 10 of them trigger on overlapping queries, you have a conflict problem, not a quantity problem.

Should I include code examples in SKILL.md?

Yes. Concrete examples reduce ambiguity more than prose instructions. Include 2-3 examples showing the mapping from user query to expected behavior, plus at least one negative example showing what the skill should reject. The agent uses these examples for in-context reasoning about edge cases.

How do I know if my skill actually triggers?

Without monitoring, you don't. Skill failures are silent by default — no error, no log, no notification. You can manually test by describing the task in varied ways and checking the agent's response for skill invocation signals. For automated monitoring, run npx selftune@latest doctor to check description coverage and identify structural gaps, or use selftune watch to monitor trigger rates from real sessions over time.