What Are Agent Skills? A Developer's Guide
If you use an AI coding agent — Claude Code, Codex, Copilot, or any of the open-source alternatives — you've probably encountered agent skills already, whether you realized it or not. But what are agent skills, exactly? And why do they matter for how you work with AI?
Agent skills are reusable instruction sets, typically defined in SKILL.md files, that extend AI coding agents with specialized capabilities. When a skill is installed, the agent reads its instructions, matches incoming user queries against the skill's trigger conditions, and executes the skill's defined workflow when a match occurs. Skills let agents do things they couldn't do out of the box — generate project-specific boilerplate, enforce team coding standards, run multi-step deployment workflows, or apply domain expertise the base model doesn't have.
Think of skills as the agent equivalent of shell aliases, IDE snippets, and CI scripts combined into a single portable format. They encode how to do something, not just what to do, and they travel with your project.
What Are Agent Skills and How Do They Work
Agent skills emerged in late 2025 when Anthropic introduced the SKILL.md standard for Claude Code. The idea was straightforward: give developers a way to teach their agent repeatable workflows without re-explaining them every session. Other platforms adopted the concept quickly. By early 2026, Codex, OpenCode, Copilot, and several open-source agents all support some form of skill definition. As of March 2026, over 400K skills exist across public registries and private repositories.
The lifecycle of a skill invocation follows a consistent pattern across platforms:
- Discovery. The agent scans for SKILL.md files in the project directory, user home directory, or a configured skills registry.
- Matching. When a user sends a query, the agent compares it against each skill's name, description, and trigger conditions. If the query matches, the skill activates.
- Execution. The agent follows the skill's instructions — a sequence of steps that may include reading files, running commands, calling APIs, or generating code.
- Output. The agent returns results to the user, formatted according to the skill's output specification.
This happens transparently. The user doesn't explicitly invoke a skill by name (though they can). The agent decides which skill to trigger based on the query's intent and the skill's description.
The SKILL.md Standard Explained
A SKILL.md file is a markdown document that defines a single skill. The format is intentionally simple — it's designed to be readable by both humans and agents. Here's a minimal example:
# Deploy to Staging
## Description
Deploys the current branch to the staging environment using
the team's standard deployment pipeline.
## Trigger
User asks to deploy, push to staging, or test in staging environment.
## Instructions
1. Run `git status` to confirm clean working tree
2. Read `.deploy/staging.yml` for environment config
3. Execute `./scripts/deploy.sh --env staging --branch $(git branch --show-current)`
4. Verify deployment by checking health endpoint at $STAGING_URL/health
5. Report deployment status with commit SHA and environment URL
The key fields:
- Name (the H1 heading): Human-readable identifier.
- Description: What the skill does. Agents use this for matching — a vague description means missed triggers.
- Trigger: Conditions under which the skill should activate. Some formats use explicit trigger phrases; others rely on the description alone.
- Instructions: The step-by-step workflow the agent follows.
Why does a standard matter? Before SKILL.md, developers encoded agent behaviors in system prompts, CLAUDE.md files, or ad-hoc markdown documents. These worked but weren't portable. A skill written for Claude Code couldn't run in Codex. The SKILL.md standard creates interoperability — write once, run across any agent that supports the format.
That said, the standard is still evolving. Anthropic's Claude Code documentation defines the canonical format, but implementations vary across platforms. OpenAI's Codex agent documentation describes a similar concept with slightly different conventions.
Types of Agent Skills
Skills fall into four broad categories based on what they do:
| Category | What It Does | Example |
|---|---|---|
| Workflow skills | Automate multi-step processes | Deploy to staging, run full test suite, create PR with changelog |
| Tool integration skills | Connect agent to external services | Post to Slack, create Linear ticket, query Datadog |
| Knowledge skills | Inject domain expertise | Company style guide, API design standards, regulatory compliance rules |
| Analysis skills | Evaluate code or data | Security audit, performance review, accessibility check, SEO audit |
Workflow skills are the most common. They encode sequences of actions your team repeats — deployments, release processes, environment setup. Without a skill, you'd re-explain the process to the agent each time.
Tool integration skills bridge the gap between the agent and external systems. While MCP (Model Context Protocol) provides a protocol-level interface to external tools, integration skills operate at a higher level — they define when and how to use those tools in context.
Knowledge skills are underappreciated. They don't automate actions; they inject context. A knowledge skill might contain your team's API naming conventions, your company's data handling policies, or domain-specific terminology. The agent consults these when making decisions.
Analysis skills apply structured evaluation criteria to code or output. A security audit skill, for example, might define a checklist of vulnerabilities to scan for, the order to check them, and the format to report findings.
Most real-world skills blend categories. A "create PR" skill might combine workflow (branch, commit, push, create PR), knowledge (team PR template, required reviewers), and analysis (run linter, check test coverage) into a single invocation.
How Agent Skills Trigger (and Why They Sometimes Don't)
The triggering mechanism is the most important — and most fragile — part of the skill system.
When a user sends a query, the agent evaluates it against every installed skill's description and trigger conditions. This evaluation is probabilistic, not deterministic. The agent uses the skill's description text to decide relevance. If the description closely matches the user's intent, the skill fires. If not, it doesn't.
This creates a specific failure mode: silent misses. When a skill fails to trigger, there's no error message, no log entry, no signal at all. The agent simply proceeds without the skill, often producing a generic response instead of the specialized one the skill would have generated.
Consider a skill with the description "Generates a PowerPoint presentation from markdown input." A user asks: "Can you make me some slides for the Q3 board meeting?" The intent maps to the skill, but the description doesn't mention "slides" or "board meeting." The skill doesn't fire. The user gets a generic response. Neither the user nor the skill author knows the skill was relevant.
This is the false negative problem — the skill should have fired but didn't, and nobody noticed. The inverse, false positives (skill fires when it shouldn't), is less common but equally damaging.
The concept of skill observability addresses this gap: monitoring whether skills actually trigger correctly against real user queries, measuring trigger accuracy, and surfacing missed invocations. Without observability, skill quality degrades silently over time as user language drifts from the original description's assumptions.
Agent Skills vs Plugins vs Extensions vs MCP Tools
Agent skills are often confused with browser plugins, IDE extensions, and MCP tools. They share surface-level similarities — all extend a system's capabilities — but they operate at different layers and solve different problems.
| Feature | Agent Skills (SKILL.md) | Browser Plugins | IDE Extensions | MCP Tools |
|---|---|---|---|---|
| Operates at | Agent layer | Browser layer | IDE layer | Protocol layer |
| Defined in | Markdown (SKILL.md) | JavaScript/manifest | TypeScript/JSON | JSON-RPC server |
| Triggered by | Natural language intent | User click or page event | User action or command | Agent function call |
| Installed via | File in project or registry | Browser store | VS Code marketplace | MCP server config |
| Portable across platforms | Yes (SKILL.md standard) | Browser-specific | IDE-specific | Agent-agnostic |
| Composable | Yes (skills can invoke other skills) | Limited | Limited | Yes (tools compose) |
| Requires code | No (markdown only) | Yes | Yes | Yes |
| User interaction model | Conversational | GUI | GUI/Command palette | Programmatic |
The key distinction: skills are instruction-based and agent-native. They don't require code — a product manager can write a skill in markdown. They trigger from natural language, not clicks or commands. And they compose: a deployment skill can invoke a testing skill, which invokes a notification skill.
MCP tools and agent skills are complementary, not competing. MCP defines what tools exist (read a file, query a database, send a message). Skills define when and how to use them (when the user asks to deploy, read the config, run the deploy script, then send a Slack notification). MCP is the hands; skills are the playbook.
How to Evaluate Agent Skill Quality
Skill quality has three measurable dimensions:
Trigger rate. Does the skill fire when it should? Measure this by running a diverse set of queries that should invoke the skill and counting how many actually do. A good trigger rate is above 90%. Most skills, untested, land between 50-70%.
False positive rate. Does the skill fire when it shouldn't? Run queries unrelated to the skill and check for unwanted activations. False positives erode trust — users learn to distrust skills that activate incorrectly.
Execution quality. When the skill fires, does it produce good results? This is harder to measure automatically but can be graded against expected outputs for known inputs.
Tools like selftune detect missed triggers and grade execution quality using real session data. Rather than relying on synthetic test prompts that authors write, selftune monitors actual user sessions, surfaces queries where a skill should have fired but didn't, and proposes improved descriptions. It validates changes against eval sets before deploying them and rolls back automatically if quality drops.
The difference between testing and monitoring matters here. Testing catches failures you can predict. Monitoring catches failures you can't — the gap between how you phrase test prompts and how your users actually talk.
To run a quick health check on your installed skills:
npx selftune@latest doctor
This takes about two minutes, requires no API keys, and produces a health report for every skill — which ones trigger correctly, which ones miss, and what to fix.
Frequently Asked Questions About Agent Skills
How many agent skills exist? Over 400K skills exist across public registries, GitHub repositories, and private installations as of early 2026. The number roughly doubled between December 2025 and March 2026 as more platforms adopted the SKILL.md standard.
Do I need to install agent skills?
Most skills install by adding a SKILL.md file to your project's .claude/skills/ directory (for Claude Code) or equivalent path for other agents. Public skills can be installed via package managers — npx skills add <skill-name> is a common pattern. Some agents also support skill registries that sync automatically.
Can skills conflict with each other? Yes. Skill conflicts are a known issue. When two skills have overlapping trigger conditions — for example, a "create PR" skill and a "code review" skill that both match on "review my changes" — the agent must choose one. The selection is nondeterministic and varies between sessions. Conflict detection is an active area of tooling development.
How do I know if my skills are working? Without observability tools, you don't. Skills fail silently — no error, no log, no signal. You might notice the agent giving generic responses where you expected specialized ones, but you won't know which skill missed or why. Runtime monitoring tools surface these failures by comparing user queries against installed skill descriptions and flagging mismatches.
selftune is open source (MIT). View on GitHub.
Related reading:
- Anthropic: Claude Code Skills Documentation
- Anthropic: Demystifying evals for AI agents
- Model Context Protocol