Intent Classification
The mechanism by which a system determines what type of request it is handling — inspecting user input and assigning it to a category before any generation or action begins.
What is it?
Every system that handles more than one type of task must answer a question before it can do anything useful: what does the user actually want? Intent classification is the process of answering that question. It takes raw user input — a message, a query, a command — and assigns it to a predefined category (an “intent”) that determines what happens next.1
The parent concept, prompt-routing, introduces the idea that routing is a separate step from generation: the router decides who answers the question, not what the answer is. Intent classification is the engine inside that router. It is the specific mechanism that inspects the input, interprets what the user is asking for, and produces a label that the routing table uses to select the right handler.2
Intent classification is not a new idea. It predates large language models entirely. Traditional chatbot frameworks like Google’s Dialogflow and IBM Watson have used intent classification since the mid-2010s, matching user utterances to predefined intents using a combination of keyword rules, machine learning models, and training examples.3 What has changed is the sophistication of the classifiers available: modern systems can use LLMs themselves as classifiers, handling ambiguous, multilingual, and context-dependent input that would defeat older approaches.
The critical design decision in intent classification is not whether to classify — it is how. Three distinct tiers of classification exist, each with different trade-offs in speed, cost, flexibility, and reliability. Most production systems stack multiple tiers, using fast cheap methods first and escalating to expensive flexible methods only when needed.4
In plain terms
Intent classification is like a triage nurse in a hospital emergency room. The nurse does not treat anyone — they listen to each patient’s description of what is wrong, decide which category it falls into (cardiac, trauma, minor injury, psychiatric), and send the patient to the right department. The quality of the triage determines whether patients get the right treatment quickly.
At a glance
Three tiers of intent classification (click to expand)
graph TD INPUT[User Input] --> T1[Tier 1 - Rule-Based] T1 -->|match| HANDLER[Route to Handler] T1 -->|no match| T2[Tier 2 - Semantic] T2 -->|confident| HANDLER T2 -->|uncertain| T3[Tier 3 - LLM Classifier] T3 -->|classified| HANDLER T3 -->|low confidence| FB[Fallback - Ask User] FB -->|clarified| INPUTKey: Input enters at the cheapest, fastest tier. If that tier cannot classify confidently, it escalates to the next. Each tier is more capable but more expensive. The fallback path handles cases where even the LLM classifier is uncertain. This multi-tier stacking pattern keeps costs low while maintaining high classification accuracy.4
How does it work?
Intent classification operates through three tiers of increasing sophistication. Understanding each tier — and when to use it — is the core skill in designing a classification system.
1. Rule-based classification — keywords, regex, exact matches
The simplest tier uses deterministic rules: if the input contains “cancel” and “subscription”, classify as cancellation. If it starts with “/help”, classify as help_request. Regular expressions, keyword lists, and pattern matching are the tools here.1
Rule-based classification is fast (microseconds), free (no API calls), predictable (the same input always produces the same output), and easy to debug (you can read the rules and understand every decision). These properties make it the ideal first pass in any classification system.
The limitation is rigidity. Rules fail on paraphrases (“I want to stop paying” vs “cancel my subscription”), multilingual input, typos, and any phrasing the rule author did not anticipate. As the number of intents grows, the rule set becomes brittle and increasingly difficult to maintain without conflicts.5
For example:
def rule_classify(message):
msg = message.lower()
if any(kw in msg for kw in ["cancel", "unsubscribe", "stop"]):
return "cancellation"
if any(kw in msg for kw in ["invoice", "bill", "payment", "charge"]):
return "billing"
if any(kw in msg for kw in ["bug", "error", "crash", "broken"]):
return "technical"
return None # escalate to next tierThink of it like...
A vending machine with labelled buttons. Press B3 and you always get the same snack. Fast, reliable, zero ambiguity — but if what you want is not on a button, the machine cannot help you.
2. Semantic classification — embedding similarity
The second tier uses vector embeddings to compare the user’s input against descriptions of each intent. Both the input and the intent descriptions are converted into numerical vectors (embeddings), and the system selects the intent whose description is most similar to the input in vector space.4
This approach handles paraphrases naturally. “I want to stop paying” and “cancel my subscription” produce similar embeddings even though they share no keywords. It also handles multilingual input if the embedding model supports multiple languages.
The process works as follows:
- Define each intent with a description:
cancellation = "User wants to cancel, end, or stop their subscription or service" - Pre-compute an embedding vector for each intent description
- When input arrives, compute its embedding
- Compare the input embedding against all intent embeddings using cosine similarity
- Select the intent with the highest similarity score — if it exceeds a confidence threshold
The confidence threshold is critical. If the best match scores 0.92 out of 1.0, the classification is confident. If it scores 0.61, the input is ambiguous and should be escalated to the next tier rather than routed to a potentially wrong handler.5
Think of it like...
A librarian who understands what you mean, not just what you say. You ask for “a book about growing tomatoes” and the librarian finds it in the gardening section — even though no book is titled exactly that. The librarian matches meaning, not keywords.
Example: embedding-based intent matching (click to expand)
Consider a system with four intents and their descriptions:
Intent Description billing ”Questions about invoices, charges, payments, or pricing” technical ”Bug reports, errors, crashes, or technical problems” account ”Account settings, password changes, profile updates” general ”General questions, feedback, or other topics” User input: “Why was I charged twice this month?”
Intent Cosine similarity billing 0.89 account 0.52 technical 0.31 general 0.28 The billing intent wins with high confidence. The system routes to the billing handler without needing an LLM call.
3. LLM-based classification — the model as classifier
The most flexible tier uses a language model to classify the input. The classifier prompt describes the available intents and asks the model to select the best match, typically returning a structured label rather than free text.2
For example:
System: You are an intent classifier. Classify the user
message into exactly one category. Return ONLY the
category name.
Categories:
- billing: questions about charges, invoices, payments
- technical: bug reports, errors, system issues
- account: profile settings, passwords, preferences
- cancellation: stopping or ending a subscription
- general: everything else
User: "Je voudrais annuler mon abonnement"
Response: cancellation
The LLM-based classifier handles the French input (“I would like to cancel my subscription”) correctly — something neither keyword rules nor simple embeddings would manage reliably without multilingual support. It also handles ambiguous, multi-intent, and context-dependent input far better than the other tiers.2
The trade-offs are cost (an LLM API call for every classification), latency (hundreds of milliseconds vs microseconds for rules), and non-determinism (the same input might occasionally be classified differently on different runs). Production systems mitigate these by using a small, fast model (e.g., GPT-4o-mini) for classification rather than a large, expensive one.1
Key distinction
Few-shot classification improves LLM-based classifiers significantly. Instead of only describing the categories, you include 2-3 examples of each category in the prompt. The model uses these examples as reference points, producing more consistent classifications — especially for edge cases and ambiguous input.1
4. Designing an intent taxonomy
The quality of classification depends heavily on the quality of the intent definitions. A well-designed intent taxonomy has these properties:5
- Mutually exclusive — each input should clearly belong to one category, not several
- Collectively exhaustive — every possible input should match at least one category (use a “general” or “other” catch-all)
- Appropriately granular — too few categories force dissimilar requests into the same handler; too many create overlaps and confusion
- Stable — categories should not change frequently, though new ones can be added
A common mistake is designing categories based on internal system architecture rather than user intent. Users do not think in terms of your microservices; they think in terms of what they want to accomplish. “Change my address” and “change my password” are different intents even though both touch the “account” service.
5. Confidence thresholds and fallback handling
Every classification system produces uncertain results. The mechanism for handling uncertainty is what separates a frustrating system from an intelligent one.4
Confidence thresholds define the minimum certainty required before acting on a classification. If a semantic classifier returns a top score of 0.55, the system should not route confidently — it should either escalate to a higher tier or ask the user for clarification.
Fallback strategies determine what happens when confidence is low:
- Clarification prompts — present the user with the top 2-3 candidate intents and let them choose: “Did you mean billing, account, or something else?”
- Default handler — route to a general-purpose agent that can handle broad queries
- Escalation — pass to a human operator when automated classification fails
- Multi-label routing — when an input legitimately spans multiple intents, run multiple handlers in parallel
Yiuno example: AGENTS.md as an intent taxonomy (click to expand)
The yiuno vault’s AGENTS.md file defines an intent taxonomy for the Claude Code agent:
User intent Routed to Create a concept card _ai/playbooks/concept-card.mdCreate a learning path _ai/playbooks/learning-path.mdPublish an article _ai/playbooks/publish.mdDeploy the site _ai/playbooks/deploy.mdThis is tier-1 (rule-based) classification: the agent matches the user’s first message against the routing table. The categories are mutually exclusive, collectively exhaustive (with a fallback menu), and stable. When a new capability is added, a new row is added to the table — the existing routes remain unchanged.
6. The multi-tier stacking pattern
The most robust production systems combine all three tiers into a cascade, processing input through increasingly capable (and expensive) classifiers until confident classification is achieved. This pattern is widely used in enterprise systems.4
| Tier | Method | Speed | Cost | Handles ambiguity | When to use |
|---|---|---|---|---|---|
| 1 | Rule-based | Microseconds | Free | No | Exact commands, structured input, slash commands |
| 2 | Semantic | Milliseconds | Low | Moderate | Natural language with clear intent |
| 3 | LLM | Hundreds of ms | High | Yes | Ambiguous, multilingual, complex input |
| Fallback | User clarification | Seconds | None | Yes | All tiers uncertain |
The key insight: most inputs in a well-designed system are handled at tier 1 or 2. The expensive LLM classifier only fires for the 10-20% of inputs that are genuinely ambiguous. This keeps average classification cost low while maintaining high accuracy across the full range of inputs.4
Why do we use it?
Key reasons
1. Specialisation enables quality. Each handler can be optimised for a narrow task with focused instructions, relevant context, and appropriate tools. Classification is what makes this specialisation possible — without it, every handler must be a generalist.2
2. Cost efficiency at scale. By classifying input before processing it, the system can match model capability to task complexity. Simple queries go to small, fast, cheap models; complex ones go to large, capable, expensive models. Classification is the gatekeeper that makes this economic optimisation work.4
3. Debuggability and accountability. When something goes wrong, the classification decision is an inspectable artifact. You can see what the input was, how it was classified, and which handler processed it. This makes errors traceable and fixable rather than mysterious.1
4. Graceful degradation. A system with confidence thresholds and fallback handling degrades gracefully when it encounters unexpected input. Instead of routing to the wrong handler and producing a bad result, it asks for clarification or routes to a general handler — the user experience remains acceptable even in failure modes.5
When do we use it?
- When a system handles multiple types of requests that require different handlers
- When building multi-agent systems where specialised agents handle different domains
- When cost optimisation matters and different request types justify different model sizes
- When the system receives input in multiple languages or highly variable phrasing
- When you need auditability — a record of why each request was handled the way it was
- When user experience depends on fast, accurate routing to the right capability
Rule of thumb
If your system has more than three distinct types of requests, build an explicit intent classifier rather than relying on a single prompt with conditional logic. The classifier scales; the conditional prompt does not.
How can I think about it?
The airport check-in analogy
Intent classification works like the check-in process at an airport.
- Passengers arrive with different needs: domestic flights, international flights, first class, special assistance
- The signage and kiosks (tier 1 - rules) handle the obvious cases: scan your boarding pass, get directed to the right terminal
- The check-in agent (tier 2 - semantic) handles passengers who do not fit the simple categories: “I have a connecting flight through Zurich” — the agent understands the intent and routes accordingly
- The supervisor (tier 3 - LLM) handles the truly complex cases: “I missed my flight, I have a visa issue, and my luggage is in another country”
- If nobody can help, the information desk (fallback) asks clarifying questions and finds the right person
Most passengers (80%+) are handled by signage and kiosks. The expensive human agents only handle the edge cases. The airport is designed so that the common path is fast and cheap, and the complex path is thorough and flexible.
The mail room analogy
Intent classification works like a corporate mail room sorting incoming correspondence.
- Every piece of mail arrives at the central mail room (system entry point)
- Machine-readable mail (tier 1): envelopes with department codes or standard forms are sorted automatically by machine. Fast, zero ambiguity, handles 60% of volume
- Handwritten addresses (tier 2): a sorting clerk reads the address and matches it to the right department. Handles variation in handwriting but occasionally misreads
- Unclear or damaged mail (tier 3): a senior clerk opens the envelope, reads the content, and determines where it should go. Slow but handles anything
- Truly unidentifiable mail (fallback): returned to sender with a note asking for clarification
The mail room is efficient because it uses the cheapest method that works for each piece of mail, not the most expensive method for all mail. Intent classification follows the same economic logic.
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| prompt-routing | The broader routing framework that uses intent classification as its engine | complete |
| context-cascading | How classification results determine which context layers are loaded | complete |
| autonomy-spectrum | How classification confidence affects whether the system acts autonomously or asks for human input | complete |
| guardrails | Constraints that prevent misclassification from causing harmful actions | complete |
| embeddings | The vector representations that power semantic classification | complete |
Some cards don't exist yet
A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain what intent classification does and why it is a separate step from the actual task execution. What problem does separating classification from generation solve?
- Name the three tiers of intent classification and describe one advantage and one limitation of each.
- Distinguish between semantic classification (embeddings) and LLM-based classification. In what situations would you prefer one over the other?
- Interpret this scenario: a customer writes “I was charged $50 but my account says I cancelled last week.” This input matches both the “billing” and “cancellation” intents. How should a well-designed classification system handle this?
- Connect intent classification to confidence thresholds. Why is a binary “classified / not classified” result insufficient, and how do confidence scores enable the multi-tier stacking pattern?
Where this concept fits
Position in the knowledge graph
graph TD LP[LLM Pipelines] --> PR[Prompt Routing] PR --> IC[Intent Classification] PR --> PBP[Playbooks as Programs] PR --> ORCH[Orchestration] CC[Context Cascading] -.->|prerequisite| PR IC -.->|related| AS[Autonomy Spectrum] IC -.->|related| GR[Guardrails] style IC fill:#4a9ede,color:#fffRelated concepts:
- autonomy-spectrum — classification confidence determines where on the autonomy spectrum the system operates: high confidence enables autonomous action, low confidence triggers human involvement
- guardrails — guardrails constrain what the system can do after classification; intent classification constrains where the input goes
- context-cascading — once intent is classified, the routing decision determines which context layers are loaded for the downstream handler
Sources
Further reading
Resources
- AI Agent Routing: A Practical Guide to Intent Classification (BSWEN) — Hands-on implementation guide with Python code for rule-based, semantic, and LLM-based classification
- LLM Routing for AI Agents: The Complete Architecture Guide (ClawRouters) — Deep technical dive into multi-tier routing architecture with latency, cost, and accuracy trade-offs
- Building Effective Agents (Anthropic) — Foundational reference on agent workflow patterns including routing and classification
- AI Agent Routing: Everything You Need to Know (Zencoder) — Comprehensive overview of routing approaches, from rule-based to hierarchical multi-agent systems
- Level 2 Agents: Router Pattern Deep Dive (AI Skill Market) — Progressive walkthrough of the router pattern with increasing complexity levels
Footnotes
-
BSWEN. (2026). AI Agent Routing: A Practical Guide to Intent Classification and Routing Implementation. BSWEN. ↩ ↩2 ↩3 ↩4 ↩5
-
Schluntz, E. and Zhang, B. (2024). Building Effective Agents. Anthropic. ↩ ↩2 ↩3 ↩4
-
Google Cloud. (2025). Generative Fallback — Dialogflow CX. Google Cloud. ↩
-
ClawRouters Team. (2026). LLM Routing for AI Agents: The Complete Architecture Guide. ClawRouters. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
Zencoder. (2026). AI Agent Routing: Everything That You Need to Know. Zencoder. ↩ ↩2 ↩3 ↩4
