Intent Classification

The mechanism by which a system determines what type of request it is handling — inspecting user input and assigning it to a category before any generation or action begins.


What is it?

Every system that handles more than one type of task must answer a question before it can do anything useful: what does the user actually want? Intent classification is the process of answering that question. It takes raw user input — a message, a query, a command — and assigns it to a predefined category (an “intent”) that determines what happens next.1

The parent concept, prompt-routing, introduces the idea that routing is a separate step from generation: the router decides who answers the question, not what the answer is. Intent classification is the engine inside that router. It is the specific mechanism that inspects the input, interprets what the user is asking for, and produces a label that the routing table uses to select the right handler.2

Intent classification is not a new idea. It predates large language models entirely. Traditional chatbot frameworks like Google’s Dialogflow and IBM Watson have used intent classification since the mid-2010s, matching user utterances to predefined intents using a combination of keyword rules, machine learning models, and training examples.3 What has changed is the sophistication of the classifiers available: modern systems can use LLMs themselves as classifiers, handling ambiguous, multilingual, and context-dependent input that would defeat older approaches.

The critical design decision in intent classification is not whether to classify — it is how. Three distinct tiers of classification exist, each with different trade-offs in speed, cost, flexibility, and reliability. Most production systems stack multiple tiers, using fast cheap methods first and escalating to expensive flexible methods only when needed.4

In plain terms

Intent classification is like a triage nurse in a hospital emergency room. The nurse does not treat anyone — they listen to each patient’s description of what is wrong, decide which category it falls into (cardiac, trauma, minor injury, psychiatric), and send the patient to the right department. The quality of the triage determines whether patients get the right treatment quickly.


At a glance


How does it work?

Intent classification operates through three tiers of increasing sophistication. Understanding each tier — and when to use it — is the core skill in designing a classification system.

1. Rule-based classification — keywords, regex, exact matches

The simplest tier uses deterministic rules: if the input contains “cancel” and “subscription”, classify as cancellation. If it starts with “/help”, classify as help_request. Regular expressions, keyword lists, and pattern matching are the tools here.1

Rule-based classification is fast (microseconds), free (no API calls), predictable (the same input always produces the same output), and easy to debug (you can read the rules and understand every decision). These properties make it the ideal first pass in any classification system.

The limitation is rigidity. Rules fail on paraphrases (“I want to stop paying” vs “cancel my subscription”), multilingual input, typos, and any phrasing the rule author did not anticipate. As the number of intents grows, the rule set becomes brittle and increasingly difficult to maintain without conflicts.5

For example:

def rule_classify(message):
    msg = message.lower()
    if any(kw in msg for kw in ["cancel", "unsubscribe", "stop"]):
        return "cancellation"
    if any(kw in msg for kw in ["invoice", "bill", "payment", "charge"]):
        return "billing"
    if any(kw in msg for kw in ["bug", "error", "crash", "broken"]):
        return "technical"
    return None  # escalate to next tier

Think of it like...

A vending machine with labelled buttons. Press B3 and you always get the same snack. Fast, reliable, zero ambiguity — but if what you want is not on a button, the machine cannot help you.

2. Semantic classification — embedding similarity

The second tier uses vector embeddings to compare the user’s input against descriptions of each intent. Both the input and the intent descriptions are converted into numerical vectors (embeddings), and the system selects the intent whose description is most similar to the input in vector space.4

This approach handles paraphrases naturally. “I want to stop paying” and “cancel my subscription” produce similar embeddings even though they share no keywords. It also handles multilingual input if the embedding model supports multiple languages.

The process works as follows:

  1. Define each intent with a description: cancellation = "User wants to cancel, end, or stop their subscription or service"
  2. Pre-compute an embedding vector for each intent description
  3. When input arrives, compute its embedding
  4. Compare the input embedding against all intent embeddings using cosine similarity
  5. Select the intent with the highest similarity score — if it exceeds a confidence threshold

The confidence threshold is critical. If the best match scores 0.92 out of 1.0, the classification is confident. If it scores 0.61, the input is ambiguous and should be escalated to the next tier rather than routed to a potentially wrong handler.5

Think of it like...

A librarian who understands what you mean, not just what you say. You ask for “a book about growing tomatoes” and the librarian finds it in the gardening section — even though no book is titled exactly that. The librarian matches meaning, not keywords.

3. LLM-based classification — the model as classifier

The most flexible tier uses a language model to classify the input. The classifier prompt describes the available intents and asks the model to select the best match, typically returning a structured label rather than free text.2

For example:

System: You are an intent classifier. Classify the user
message into exactly one category. Return ONLY the
category name.

Categories:
- billing: questions about charges, invoices, payments
- technical: bug reports, errors, system issues
- account: profile settings, passwords, preferences
- cancellation: stopping or ending a subscription
- general: everything else

User: "Je voudrais annuler mon abonnement"
Response: cancellation

The LLM-based classifier handles the French input (“I would like to cancel my subscription”) correctly — something neither keyword rules nor simple embeddings would manage reliably without multilingual support. It also handles ambiguous, multi-intent, and context-dependent input far better than the other tiers.2

The trade-offs are cost (an LLM API call for every classification), latency (hundreds of milliseconds vs microseconds for rules), and non-determinism (the same input might occasionally be classified differently on different runs). Production systems mitigate these by using a small, fast model (e.g., GPT-4o-mini) for classification rather than a large, expensive one.1

Key distinction

Few-shot classification improves LLM-based classifiers significantly. Instead of only describing the categories, you include 2-3 examples of each category in the prompt. The model uses these examples as reference points, producing more consistent classifications — especially for edge cases and ambiguous input.1

4. Designing an intent taxonomy

The quality of classification depends heavily on the quality of the intent definitions. A well-designed intent taxonomy has these properties:5

  • Mutually exclusive — each input should clearly belong to one category, not several
  • Collectively exhaustive — every possible input should match at least one category (use a “general” or “other” catch-all)
  • Appropriately granular — too few categories force dissimilar requests into the same handler; too many create overlaps and confusion
  • Stable — categories should not change frequently, though new ones can be added

A common mistake is designing categories based on internal system architecture rather than user intent. Users do not think in terms of your microservices; they think in terms of what they want to accomplish. “Change my address” and “change my password” are different intents even though both touch the “account” service.

5. Confidence thresholds and fallback handling

Every classification system produces uncertain results. The mechanism for handling uncertainty is what separates a frustrating system from an intelligent one.4

Confidence thresholds define the minimum certainty required before acting on a classification. If a semantic classifier returns a top score of 0.55, the system should not route confidently — it should either escalate to a higher tier or ask the user for clarification.

Fallback strategies determine what happens when confidence is low:

  • Clarification prompts — present the user with the top 2-3 candidate intents and let them choose: “Did you mean billing, account, or something else?”
  • Default handler — route to a general-purpose agent that can handle broad queries
  • Escalation — pass to a human operator when automated classification fails
  • Multi-label routing — when an input legitimately spans multiple intents, run multiple handlers in parallel

6. The multi-tier stacking pattern

The most robust production systems combine all three tiers into a cascade, processing input through increasingly capable (and expensive) classifiers until confident classification is achieved. This pattern is widely used in enterprise systems.4

TierMethodSpeedCostHandles ambiguityWhen to use
1Rule-basedMicrosecondsFreeNoExact commands, structured input, slash commands
2SemanticMillisecondsLowModerateNatural language with clear intent
3LLMHundreds of msHighYesAmbiguous, multilingual, complex input
FallbackUser clarificationSecondsNoneYesAll tiers uncertain

The key insight: most inputs in a well-designed system are handled at tier 1 or 2. The expensive LLM classifier only fires for the 10-20% of inputs that are genuinely ambiguous. This keeps average classification cost low while maintaining high accuracy across the full range of inputs.4


Why do we use it?

Key reasons

1. Specialisation enables quality. Each handler can be optimised for a narrow task with focused instructions, relevant context, and appropriate tools. Classification is what makes this specialisation possible — without it, every handler must be a generalist.2

2. Cost efficiency at scale. By classifying input before processing it, the system can match model capability to task complexity. Simple queries go to small, fast, cheap models; complex ones go to large, capable, expensive models. Classification is the gatekeeper that makes this economic optimisation work.4

3. Debuggability and accountability. When something goes wrong, the classification decision is an inspectable artifact. You can see what the input was, how it was classified, and which handler processed it. This makes errors traceable and fixable rather than mysterious.1

4. Graceful degradation. A system with confidence thresholds and fallback handling degrades gracefully when it encounters unexpected input. Instead of routing to the wrong handler and producing a bad result, it asks for clarification or routes to a general handler — the user experience remains acceptable even in failure modes.5


When do we use it?

  • When a system handles multiple types of requests that require different handlers
  • When building multi-agent systems where specialised agents handle different domains
  • When cost optimisation matters and different request types justify different model sizes
  • When the system receives input in multiple languages or highly variable phrasing
  • When you need auditability — a record of why each request was handled the way it was
  • When user experience depends on fast, accurate routing to the right capability

Rule of thumb

If your system has more than three distinct types of requests, build an explicit intent classifier rather than relying on a single prompt with conditional logic. The classifier scales; the conditional prompt does not.


How can I think about it?

The airport check-in analogy

Intent classification works like the check-in process at an airport.

  • Passengers arrive with different needs: domestic flights, international flights, first class, special assistance
  • The signage and kiosks (tier 1 - rules) handle the obvious cases: scan your boarding pass, get directed to the right terminal
  • The check-in agent (tier 2 - semantic) handles passengers who do not fit the simple categories: “I have a connecting flight through Zurich” — the agent understands the intent and routes accordingly
  • The supervisor (tier 3 - LLM) handles the truly complex cases: “I missed my flight, I have a visa issue, and my luggage is in another country”
  • If nobody can help, the information desk (fallback) asks clarifying questions and finds the right person

Most passengers (80%+) are handled by signage and kiosks. The expensive human agents only handle the edge cases. The airport is designed so that the common path is fast and cheap, and the complex path is thorough and flexible.

The mail room analogy

Intent classification works like a corporate mail room sorting incoming correspondence.

  • Every piece of mail arrives at the central mail room (system entry point)
  • Machine-readable mail (tier 1): envelopes with department codes or standard forms are sorted automatically by machine. Fast, zero ambiguity, handles 60% of volume
  • Handwritten addresses (tier 2): a sorting clerk reads the address and matches it to the right department. Handles variation in handwriting but occasionally misreads
  • Unclear or damaged mail (tier 3): a senior clerk opens the envelope, reads the content, and determines where it should go. Slow but handles anything
  • Truly unidentifiable mail (fallback): returned to sender with a note asking for clarification

The mail room is efficient because it uses the cheapest method that works for each piece of mail, not the most expensive method for all mail. Intent classification follows the same economic logic.


Concepts to explore next

ConceptWhat it coversStatus
prompt-routingThe broader routing framework that uses intent classification as its enginecomplete
context-cascadingHow classification results determine which context layers are loadedcomplete
autonomy-spectrumHow classification confidence affects whether the system acts autonomously or asks for human inputcomplete
guardrailsConstraints that prevent misclassification from causing harmful actionscomplete
embeddingsThe vector representations that power semantic classificationcomplete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    LP[LLM Pipelines] --> PR[Prompt Routing]
    PR --> IC[Intent Classification]
    PR --> PBP[Playbooks as Programs]
    PR --> ORCH[Orchestration]
    CC[Context Cascading] -.->|prerequisite| PR
    IC -.->|related| AS[Autonomy Spectrum]
    IC -.->|related| GR[Guardrails]
    style IC fill:#4a9ede,color:#fff

Related concepts:

  • autonomy-spectrum — classification confidence determines where on the autonomy spectrum the system operates: high confidence enables autonomous action, low confidence triggers human involvement
  • guardrails — guardrails constrain what the system can do after classification; intent classification constrains where the input goes
  • context-cascading — once intent is classified, the routing decision determines which context layers are loaded for the downstream handler

Sources


Further reading

Resources

Footnotes

  1. BSWEN. (2026). AI Agent Routing: A Practical Guide to Intent Classification and Routing Implementation. BSWEN. 2 3 4 5

  2. Schluntz, E. and Zhang, B. (2024). Building Effective Agents. Anthropic. 2 3 4

  3. Google Cloud. (2025). Generative Fallback — Dialogflow CX. Google Cloud.

  4. ClawRouters Team. (2026). LLM Routing for AI Agents: The Complete Architecture Guide. ClawRouters. 2 3 4 5 6 7

  5. Zencoder. (2026). AI Agent Routing: Everything That You Need to Know. Zencoder. 2 3 4