Intent Classification

The mechanism by which a system determines what type of request it is handling — inspecting user input and assigning it to a category before any generation or action begins.

What is it?

Every system that handles more than one type of task must answer a question before it can do anything useful: what does the user actually want? Intent classification is the process of answering that question. It takes raw user input — a message, a query, a command — and assigns it to a predefined category (an “intent”) that determines what happens next.¹

The parent concept, prompt-routing, introduces the idea that routing is a separate step from generation: the router decides who answers the question, not what the answer is. Intent classification is the engine inside that router. It is the specific mechanism that inspects the input, interprets what the user is asking for, and produces a label that the routing table uses to select the right handler.²

Intent classification is not a new idea. It predates large language models entirely. Traditional chatbot frameworks like Google’s Dialogflow and IBM Watson have used intent classification since the mid-2010s, matching user utterances to predefined intents using a combination of keyword rules, machine learning models, and training examples.³ What has changed is the sophistication of the classifiers available: modern systems can use LLMs themselves as classifiers, handling ambiguous, multilingual, and context-dependent input that would defeat older approaches.

The critical design decision in intent classification is not whether to classify — it is how. Three distinct tiers of classification exist, each with different trade-offs in speed, cost, flexibility, and reliability. Most production systems stack multiple tiers, using fast cheap methods first and escalating to expensive flexible methods only when needed.⁴

In plain terms

Intent classification is like a triage nurse in a hospital emergency room. The nurse does not treat anyone — they listen to each patient’s description of what is wrong, decide which category it falls into (cardiac, trauma, minor injury, psychiatric), and send the patient to the right department. The quality of the triage determines whether patients get the right treatment quickly.

At a glance

Three tiers of intent classification (click to expand)
graph TD
    INPUT[User Input] --> T1[Tier 1 - Rule-Based]
    T1 -->|match| HANDLER[Route to Handler]
    T1 -->|no match| T2[Tier 2 - Semantic]
    T2 -->|confident| HANDLER
    T2 -->|uncertain| T3[Tier 3 - LLM Classifier]
    T3 -->|classified| HANDLER
    T3 -->|low confidence| FB[Fallback - Ask User]
    FB -->|clarified| INPUT
Key: Input enters at the cheapest, fastest tier. If that tier cannot classify confidently, it escalates to the next. Each tier is more capable but more expensive. The fallback path handles cases where even the LLM classifier is uncertain. This multi-tier stacking pattern keeps costs low while maintaining high classification accuracy.⁴

How does it work?

Intent classification operates through three tiers of increasing sophistication. Understanding each tier — and when to use it — is the core skill in designing a classification system.

1. Rule-based classification — keywords, regex, exact matches

The simplest tier uses deterministic rules: if the input contains “cancel” and “subscription”, classify as cancellation. If it starts with “/help”, classify as help_request. Regular expressions, keyword lists, and pattern matching are the tools here.¹

Rule-based classification is fast (microseconds), free (no API calls), predictable (the same input always produces the same output), and easy to debug (you can read the rules and understand every decision). These properties make it the ideal first pass in any classification system.

The limitation is rigidity. Rules fail on paraphrases (“I want to stop paying” vs “cancel my subscription”), multilingual input, typos, and any phrasing the rule author did not anticipate. As the number of intents grows, the rule set becomes brittle and increasingly difficult to maintain without conflicts.⁵

For example:

def rule_classify(message):
    msg = message.lower()
    if any(kw in msg for kw in ["cancel", "unsubscribe", "stop"]):
        return "cancellation"
    if any(kw in msg for kw in ["invoice", "bill", "payment", "charge"]):
        return "billing"
    if any(kw in msg for kw in ["bug", "error", "crash", "broken"]):
        return "technical"
    return None  # escalate to next tier

Think of it like...

A vending machine with labelled buttons. Press B3 and you always get the same snack. Fast, reliable, zero ambiguity — but if what you want is not on a button, the machine cannot help you.

2. Semantic classification — embedding similarity

The second tier uses vector embeddings to compare the user’s input against descriptions of each intent. Both the input and the intent descriptions are converted into numerical vectors (embeddings), and the system selects the intent whose description is most similar to the input in vector space.⁴

This approach handles paraphrases naturally. “I want to stop paying” and “cancel my subscription” produce similar embeddings even though they share no keywords. It also handles multilingual input if the embedding model supports multiple languages.

The process works as follows:

Define each intent with a description: cancellation = "User wants to cancel, end, or stop their subscription or service"
Pre-compute an embedding vector for each intent description
When input arrives, compute its embedding
Compare the input embedding against all intent embeddings using cosine similarity
Select the intent with the highest similarity score — if it exceeds a confidence threshold

The confidence threshold is critical. If the best match scores 0.92 out of 1.0, the classification is confident. If it scores 0.61, the input is ambiguous and should be escalated to the next tier rather than routed to a potentially wrong handler.⁵

Think of it like...

A librarian who understands what you mean, not just what you say. You ask for “a book about growing tomatoes” and the librarian finds it in the gardening section — even though no book is titled exactly that. The librarian matches meaning, not keywords.

Example: embedding-based intent matching (click to expand)

Consider a system with four intents and their descriptions:

Intent Description
billing ”Questions about invoices, charges, payments, or pricing”
technical ”Bug reports, errors, crashes, or technical problems”
account ”Account settings, password changes, profile updates”
general ”General questions, feedback, or other topics”

User input: “Why was I charged twice this month?”

Intent Cosine similarity
billing 0.89
account 0.52
technical 0.31
general 0.28

The billing intent wins with high confidence. The system routes to the billing handler without needing an LLM call.

Intent	Description
billing	”Questions about invoices, charges, payments, or pricing”
technical	”Bug reports, errors, crashes, or technical problems”
account	”Account settings, password changes, profile updates”
general	”General questions, feedback, or other topics”

Intent	Cosine similarity
billing	0.89
account	0.52
technical	0.31
general	0.28

3. LLM-based classification — the model as classifier

The most flexible tier uses a language model to classify the input. The classifier prompt describes the available intents and asks the model to select the best match, typically returning a structured label rather than free text.²

For example:

System: You are an intent classifier. Classify the user
message into exactly one category. Return ONLY the
category name.

Categories:
- billing: questions about charges, invoices, payments
- technical: bug reports, errors, system issues
- account: profile settings, passwords, preferences
- cancellation: stopping or ending a subscription
- general: everything else

User: "Je voudrais annuler mon abonnement"
Response: cancellation

The LLM-based classifier handles the French input (“I would like to cancel my subscription”) correctly — something neither keyword rules nor simple embeddings would manage reliably without multilingual support. It also handles ambiguous, multi-intent, and context-dependent input far better than the other tiers.²

The trade-offs are cost (an LLM API call for every classification), latency (hundreds of milliseconds vs microseconds for rules), and non-determinism (the same input might occasionally be classified differently on different runs). Production systems mitigate these by using a small, fast model (e.g., GPT-4o-mini) for classification rather than a large, expensive one.¹

Key distinction

Few-shot classification improves LLM-based classifiers significantly. Instead of only describing the categories, you include 2-3 examples of each category in the prompt. The model uses these examples as reference points, producing more consistent classifications — especially for edge cases and ambiguous input.¹

4. Designing an intent taxonomy

The quality of classification depends heavily on the quality of the intent definitions. A well-designed intent taxonomy has these properties:⁵

Mutually exclusive — each input should clearly belong to one category, not several
Collectively exhaustive — every possible input should match at least one category (use a “general” or “other” catch-all)
Appropriately granular — too few categories force dissimilar requests into the same handler; too many create overlaps and confusion
Stable — categories should not change frequently, though new ones can be added

A common mistake is designing categories based on internal system architecture rather than user intent. Users do not think in terms of your microservices; they think in terms of what they want to accomplish. “Change my address” and “change my password” are different intents even though both touch the “account” service.

5. Confidence thresholds and fallback handling

Every classification system produces uncertain results. The mechanism for handling uncertainty is what separates a frustrating system from an intelligent one.⁴

Confidence thresholds define the minimum certainty required before acting on a classification. If a semantic classifier returns a top score of 0.55, the system should not route confidently — it should either escalate to a higher tier or ask the user for clarification.

Fallback strategies determine what happens when confidence is low:

Clarification prompts — present the user with the top 2-3 candidate intents and let them choose: “Did you mean billing, account, or something else?”
Default handler — route to a general-purpose agent that can handle broad queries
Escalation — pass to a human operator when automated classification fails
Multi-label routing — when an input legitimately spans multiple intents, run multiple handlers in parallel

Yiuno example: AGENTS.md as an intent taxonomy (click to expand)

The yiuno vault’s AGENTS.md file defines an intent taxonomy for the Claude Code agent:

User intent Routed to
Create a concept card _ai/playbooks/concept-card.md
Create a learning path _ai/playbooks/learning-path.md
Publish an article _ai/playbooks/publish.md
Deploy the site _ai/playbooks/deploy.md

This is tier-1 (rule-based) classification: the agent matches the user’s first message against the routing table. The categories are mutually exclusive, collectively exhaustive (with a fallback menu), and stable. When a new capability is added, a new row is added to the table — the existing routes remain unchanged.

User intent	Routed to
Create a concept card	`_ai/playbooks/concept-card.md`
Create a learning path	`_ai/playbooks/learning-path.md`
Publish an article	`_ai/playbooks/publish.md`
Deploy the site	`_ai/playbooks/deploy.md`

6. The multi-tier stacking pattern

The most robust production systems combine all three tiers into a cascade, processing input through increasingly capable (and expensive) classifiers until confident classification is achieved. This pattern is widely used in enterprise systems.⁴

Tier	Method	Speed	Cost	Handles ambiguity	When to use
1	Rule-based	Microseconds	Free	No	Exact commands, structured input, slash commands
2	Semantic	Milliseconds	Low	Moderate	Natural language with clear intent
3	LLM	Hundreds of ms	High	Yes	Ambiguous, multilingual, complex input
Fallback	User clarification	Seconds	None	Yes	All tiers uncertain

The key insight: most inputs in a well-designed system are handled at tier 1 or 2. The expensive LLM classifier only fires for the 10-20% of inputs that are genuinely ambiguous. This keeps average classification cost low while maintaining high accuracy across the full range of inputs.⁴

Why do we use it?

Key reasons

1. Specialisation enables quality. Each handler can be optimised for a narrow task with focused instructions, relevant context, and appropriate tools. Classification is what makes this specialisation possible — without it, every handler must be a generalist.²

2. Cost efficiency at scale. By classifying input before processing it, the system can match model capability to task complexity. Simple queries go to small, fast, cheap models; complex ones go to large, capable, expensive models. Classification is the gatekeeper that makes this economic optimisation work.⁴

3. Debuggability and accountability. When something goes wrong, the classification decision is an inspectable artifact. You can see what the input was, how it was classified, and which handler processed it. This makes errors traceable and fixable rather than mysterious.¹

4. Graceful degradation. A system with confidence thresholds and fallback handling degrades gracefully when it encounters unexpected input. Instead of routing to the wrong handler and producing a bad result, it asks for clarification or routes to a general handler — the user experience remains acceptable even in failure modes.⁵

When do we use it?

When a system handles multiple types of requests that require different handlers
When building multi-agent systems where specialised agents handle different domains
When cost optimisation matters and different request types justify different model sizes
When the system receives input in multiple languages or highly variable phrasing
When you need auditability — a record of why each request was handled the way it was
When user experience depends on fast, accurate routing to the right capability

Rule of thumb

If your system has more than three distinct types of requests, build an explicit intent classifier rather than relying on a single prompt with conditional logic. The classifier scales; the conditional prompt does not.

How can I think about it?

The airport check-in analogy

Intent classification works like the check-in process at an airport.

Passengers arrive with different needs: domestic flights, international flights, first class, special assistance

The signage and kiosks (tier 1 - rules) handle the obvious cases: scan your boarding pass, get directed to the right terminal

The check-in agent (tier 2 - semantic) handles passengers who do not fit the simple categories: “I have a connecting flight through Zurich” — the agent understands the intent and routes accordingly

The supervisor (tier 3 - LLM) handles the truly complex cases: “I missed my flight, I have a visa issue, and my luggage is in another country”

If nobody can help, the information desk (fallback) asks clarifying questions and finds the right person

Most passengers (80%+) are handled by signage and kiosks. The expensive human agents only handle the edge cases. The airport is designed so that the common path is fast and cheap, and the complex path is thorough and flexible.

The mail room analogy

Intent classification works like a corporate mail room sorting incoming correspondence.

Every piece of mail arrives at the central mail room (system entry point)

Machine-readable mail (tier 1): envelopes with department codes or standard forms are sorted automatically by machine. Fast, zero ambiguity, handles 60% of volume

Handwritten addresses (tier 2): a sorting clerk reads the address and matches it to the right department. Handles variation in handwriting but occasionally misreads

Unclear or damaged mail (tier 3): a senior clerk opens the envelope, reads the content, and determines where it should go. Slow but handles anything

Truly unidentifiable mail (fallback): returned to sender with a note asking for clarification

The mail room is efficient because it uses the cheapest method that works for each piece of mail, not the most expensive method for all mail. Intent classification follows the same economic logic.

Concepts to explore next

Concept	What it covers	Status
prompt-routing	The broader routing framework that uses intent classification as its engine	complete
context-cascading	How classification results determine which context layers are loaded	complete
autonomy-spectrum	How classification confidence affects whether the system acts autonomously or asks for human input	complete
guardrails	Constraints that prevent misclassification from causing harmful actions	complete
embeddings	The vector representations that power semantic classification	complete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain what intent classification does and why it is a separate step from the actual task execution. What problem does separating classification from generation solve?

Name the three tiers of intent classification and describe one advantage and one limitation of each.

Distinguish between semantic classification (embeddings) and LLM-based classification. In what situations would you prefer one over the other?

Interpret this scenario: a customer writes “I was charged $50 but my account says I cancelled last week.” This input matches both the “billing” and “cancellation” intents. How should a well-designed classification system handle this?

Connect intent classification to confidence thresholds. Why is a binary “classified / not classified” result insufficient, and how do confidence scores enable the multi-tier stacking pattern?

Where this concept fits

Position in the knowledge graph
graph TD
    LP[LLM Pipelines] --> PR[Prompt Routing]
    PR --> IC[Intent Classification]
    PR --> PBP[Playbooks as Programs]
    PR --> ORCH[Orchestration]
    CC[Context Cascading] -.->|prerequisite| PR
    IC -.->|related| AS[Autonomy Spectrum]
    IC -.->|related| GR[Guardrails]
    style IC fill:#4a9ede,color:#fff
Related concepts:

autonomy-spectrum — classification confidence determines where on the autonomy spectrum the system operates: high confidence enables autonomous action, low confidence triggers human involvement

guardrails — guardrails constrain what the system can do after classification; intent classification constrains where the input goes

context-cascading — once intent is classified, the routing decision determines which context layers are loaded for the downstream handler

Explorer

Intent Classification

Intent Classification

What is it?

At a glance

How does it work?

1. Rule-based classification — keywords, regex, exact matches

2. Semantic classification — embedding similarity

3. LLM-based classification — the model as classifier

4. Designing an intent taxonomy

5. Confidence thresholds and fallback handling

6. The multi-tier stacking pattern

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Intent Classification

Intent Classification

What is it?

At a glance

How does it work?

1. Rule-based classification — keywords, regex, exact matches

2. Semantic classification — embedding similarity

3. LLM-based classification — the model as classifier

4. Designing an intent taxonomy

5. Confidence thresholds and fallback handling

6. The multi-tier stacking pattern

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks