Evaluator-Optimiser
A pipeline pattern where one LLM generates output, another evaluates it against quality criteria, and the generator revises until the output passes or an iteration limit is reached.
What is it?
prompt-chaining processes tasks in a straight line: step 1 feeds step 2 feeds step 3, moving forward without looking back. This works when the first attempt at each step is good enough. But for quality-sensitive tasks — literary translation, legal drafting, code generation with test suites — the first attempt is rarely good enough. You need a way to improve the output through iteration, not just check it with a pass/fail gate.1
The evaluator-optimiser pattern adds this iterative refinement. Instead of a single forward pass, it creates a loop: one LLM call generates output, a second LLM call evaluates that output against explicit criteria and provides detailed feedback, and the generator uses that feedback to produce an improved version. The cycle repeats until the evaluator is satisfied or a maximum number of iterations is reached.1
Anthropic’s Building Effective Agents guide identifies this as a distinct workflow pattern, describing it as “particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value.” The two signs of good fit are: first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback autonomously.1
The parent concept, llm-pipelines, frames the evaluator-optimiser as one of several pipeline patterns. Where prompt-chaining is the simplest pattern (a straight line) and parallelisation handles independent subtasks concurrently, the evaluator-optimiser handles tasks where quality emerges through iterative refinement rather than from a single pass.
In plain terms
Think of writing an essay with a tutor looking over your shoulder. You write a draft, the tutor reads it, marks up what needs improvement (“this paragraph is unclear,” “this claim needs evidence”), and hands it back. You revise and resubmit. The tutor checks again. This continues until the tutor says the essay is ready — or until you run out of revision rounds.
At a glance
The generate-evaluate-refine loop (click to expand)
graph TD A[Input Task] --> B[Generator LLM] B --> C[Output Draft] C --> D[Evaluator LLM] D --> E{Passes criteria?} E -->|Yes| F[Final Output] E -->|No| G{Max iterations?} G -->|No| H[Feedback] H --> B G -->|Yes| I[Best attempt + flag]Key: The generator produces a draft. The evaluator scores it against criteria and provides specific feedback. If the draft passes, it becomes the final output. If not, the feedback loops back to the generator for revision. A maximum iteration count acts as a circuit breaker to prevent infinite loops. If the limit is reached, the best attempt so far is returned with a flag indicating it did not fully pass evaluation.
How does it work?
1. The generator — producing the initial output
The generator is an LLM call whose job is to produce output for a given task. On the first iteration, it works from the original input. On subsequent iterations, it receives the original input plus the evaluator’s feedback from the previous round.2
The generator’s prompt should be focused on the task itself, not on self-evaluation. Trying to generate and evaluate simultaneously splits the model’s attention and degrades both capabilities. Separating the two roles into distinct LLM calls is the core design insight of this pattern.2
Think of it like...
A writer drafting a chapter. The writer’s job is to write the best version they can, focusing entirely on the content. They do not simultaneously try to be their own editor — that is someone else’s job. On revision rounds, the writer receives the editor’s notes and addresses each one specifically.
2. The evaluator — scoring and providing feedback
The evaluator is a separate LLM call that receives the generator’s output and assesses it against predefined criteria. The evaluator’s output is not just a pass/fail decision — it provides specific, actionable feedback that the generator can use to improve.1
The quality of the evaluator’s criteria directly determines the quality of the refinement loop. There are three common approaches to defining evaluation criteria:2
| Criteria type | When to use | Example |
|---|---|---|
| Rubric | Multi-dimensional assessment with granular scores | Score clarity (1-5), accuracy (1-5), completeness (1-5); pass if all scores are 4 or above |
| Checklist | When specific requirements must all be met | Does the output include a summary? Are all claims cited? Is the word count within range? |
| Binary pass/fail | Simple quality gates | Does the code pass all unit tests? Does the translation preserve the original meaning? |
The evaluator should output both a decision (pass or fail) and detailed feedback explaining what needs to change. Vague feedback (“make it better”) produces vague revisions. Specific feedback (“paragraph 3 makes an unsupported claim about market size — add a citation or remove the claim”) produces targeted improvements.3
Example: evaluating a code generation task (click to expand)
Consider a pipeline that generates Python functions from natural language descriptions:
Generator output: A Python function implementing the requested behaviour.
Evaluator criteria (checklist):
- Does the function produce the correct output for all provided test cases?
- Does the function handle edge cases (empty input, null values)?
- Does the function include a docstring?
- Is the time complexity acceptable for the expected input size?
Evaluator feedback (iteration 1): “The function passes 4 of 5 test cases but fails on empty input — it raises an unhandled TypeError. Add a guard clause for empty input. The docstring is missing. Time complexity is acceptable.”
Generator revision: Adds the guard clause and docstring, resubmits.
Evaluator feedback (iteration 2): “All test cases pass. Docstring present. Edge cases handled. Pass.”
3. The feedback loop — iterative refinement
The power of this pattern comes from the feedback loop itself. Each iteration narrows the gap between the current output and the quality target. The generator does not start from scratch on each iteration — it receives its previous output alongside the evaluator’s feedback and makes targeted revisions.2
This is structurally different from a validation gate in a prompt chain. A gate is binary: pass or fail. If the output fails, the step retries from scratch or halts. The evaluator-optimiser, by contrast, provides directional feedback — it tells the generator what is wrong and how to fix it, enabling incremental improvement rather than blind retry.1
Research on iterative refinement with LLMs shows that this approach reliably improves output quality across diverse tasks — code generation, translation, summarisation, and creative writing — because each iteration addresses specific deficiencies identified by the evaluator rather than hoping that a fresh attempt will randomly produce a better result.4
Key distinction
A gate asks “is this good enough?” and answers yes or no. An evaluator asks “what specifically is wrong and how should it be fixed?” Gates catch problems. Evaluators diagnose and guide the fix. Start with gates (simpler, cheaper). Upgrade to an evaluator-optimiser loop when gate failures are frequent and the fixes require nuanced judgement.1
4. Circuit breakers — preventing infinite loops
Without a maximum iteration count, a strict evaluator could keep rejecting output indefinitely — the generator and evaluator locked in an endless cycle of revision.2 Circuit breakers solve this by imposing a hard cap on iterations.
When the iteration limit is reached without passing evaluation, the system has two options:
- Return the best attempt so far with a quality flag indicating it did not fully pass. This is appropriate when partial quality is better than no output.
- Escalate to human review. The system surfaces the latest draft alongside the evaluator’s feedback, letting a human make the final call.3
In practice, most evaluator-optimiser loops converge within 2-3 iterations. If the loop regularly hits its cap, the evaluation criteria may be too strict, the generator’s prompt may need improvement, or the task may be too difficult for the model to handle without human guidance.2
Think of it like...
A student submitting a university assignment. The professor allows up to 3 resubmissions. Each resubmission includes the professor’s feedback from the previous version. If after 3 attempts the work still does not meet the standard, the student meets with the professor to discuss (human escalation). Without the resubmission cap, the student might endlessly revise a single assignment instead of moving forward.
5. Where human review intersects with automated evaluation
The evaluator-optimiser pattern does not replace human judgement — it automates the routine layers of quality assessment so that humans focus on what requires genuine expertise.3
A common production architecture layers automated and human evaluation:
| Layer | What it checks | How |
|---|---|---|
| Automated evaluator (LLM) | Structural criteria: format, completeness, factual consistency | LLM call with criteria rubric |
| Automated evaluator (code) | Deterministic criteria: test cases pass, word count within range, no prohibited terms | Programmatic checks |
| Human reviewer | Subjective criteria: appropriateness, strategic alignment, nuance | Human-in-the-loop checkpoint |
The automated layers handle the bulk of iterations. The human layer handles the cases that require judgement beyond what the LLM can reliably provide. This human-in-the-loop integration means the evaluator-optimiser loop does not need to be perfect — it needs to be good enough to reduce the human reviewer’s workload to the cases that genuinely need human attention.3
Yiuno example: concept card refinement (click to expand)
When this knowledge system creates a concept card, an evaluator-optimiser loop could automate quality review:
Generator: Writes the card following the template and playbook.
Evaluator criteria (checklist):
- Does the card follow the exact section order from the playbook?
- Are all technical terms explained or linked?
- Are all footnote references matched with entries in the Sources section?
- Does the Mermaid diagram use valid syntax?
- Are comprehension questions project-agnostic?
Iteration 1 feedback: “Section order is correct. Two technical terms are used without explanation or link: ‘context window’ in paragraph 3 and ‘temperature’ in the voting section. Footnote 4 is referenced in the body but has no matching entry in Sources.”
Generator revision: Adds inline explanations for both terms and adds the missing source entry.
Iteration 2 feedback: “All checks pass.”
Why do we use it?
Key reasons
1. Quality through iteration. For quality-sensitive tasks, iterative refinement consistently outperforms single-shot generation. The first draft captures the broad strokes; subsequent iterations address specific deficiencies identified by the evaluator.1
2. Explicit quality criteria. The pattern forces you to define what “good enough” means before generation starts. This clarity benefits the entire pipeline — the generator knows what to aim for, the evaluator knows what to check, and the human reviewer knows what has already been verified.2
3. Separation of concerns. Generation and evaluation are fundamentally different cognitive tasks. Separating them into distinct LLM calls allows each to focus fully on its role. An LLM trying to generate and self-evaluate simultaneously compromises on both.2
4. Auditability. Every iteration produces an inspectable artifact: the draft, the evaluation, and the feedback. If the final output has a problem, you can trace back through the iteration history to understand how it was (or was not) caught.3
When do we use it?
- When quality criteria are clear and measurable — rubrics, checklists, test suites, style guides
- When a first draft is reliably improvable with specific feedback (the LLM can act on critique)
- When human feedback demonstrably improves output — if a human can articulate what is wrong and the result improves, an LLM evaluator can likely do the same1
- When the task involves creative or nuanced generation — translation, writing, code — where the gap between first draft and polished output is significant
- When compliance or accuracy standards require verification before output is used
Rule of thumb
If you find yourself repeatedly asking the model to “try again” or “make it better” and the output actually improves each time, you have identified a task that would benefit from an evaluator-optimiser loop. Automate the critique and the iteration.1
How can I think about it?
The editor and the writer
A publishing house does not ship the writer’s first draft. The process works like this:
- The writer produces a manuscript (generator)
- The editor reads it and marks it up with specific feedback: “chapter 3 drags, tighten the pacing,” “the protagonist’s motivation is unclear in scene 7,” “this dialogue feels stilted” (evaluator)
- The writer revises based on the editor’s notes and resubmits (refinement loop)
- The cycle repeats — typically 2-3 rounds — until the editor approves (pass criteria)
- If the manuscript still is not ready after the agreed number of rounds, the editor and writer meet to discuss the path forward (circuit breaker / human escalation)
The writer focuses on writing. The editor focuses on quality. Neither tries to do the other’s job. The manuscript improves with each round because the feedback is specific and actionable.
The pottery wheel
A potter shaping a bowl on a wheel works in iterative refinement cycles:
- Shape the clay with your hands (generate)
- Step back and look at the form — is it symmetrical? Is the wall thickness even? Is the rim smooth? (evaluate)
- Adjust based on what you see — thin out a thick spot, smooth a wobble, widen the opening (refine)
- Repeat until the form is right or the clay has been worked too long and needs to rest (circuit breaker)
The potter does not try to create a perfect bowl in one motion. The quality emerges from the cycle of shaping, assessing, and adjusting. Each cycle addresses specific imperfections. And there is a natural limit — overworking the clay makes it worse, just as too many LLM iterations can degrade rather than improve output.
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| parallelisation | Running independent subtasks concurrently instead of sequentially | complete |
| human-in-the-loop | Integrating human judgement at checkpoints within automated pipelines | stub |
| guardrails | Safety constraints that bound what an LLM can produce or act on | stub |
Some cards don't exist yet
A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain why separating generation and evaluation into distinct LLM calls produces better results than asking a single LLM to generate and self-evaluate in one prompt.
- Name the three types of evaluation criteria (rubric, checklist, binary pass/fail) and describe when each is most appropriate.
- Distinguish between a validation gate in a prompt chain and an evaluator in an evaluator-optimiser loop. What does the evaluator provide that a gate does not?
- Interpret this scenario: an evaluator-optimiser loop for code generation hits its iteration cap of 3 on 40% of inputs. What does this suggest, and what would you investigate first?
- Connect the evaluator-optimiser pattern to the concept of human-in-the-loop review. How do automated and human evaluation layers complement each other in a production system?
Where this concept fits
Position in the knowledge graph
graph TD AS[Agentic Systems] --> LP[LLM Pipelines] LP --> PC[Prompt Chaining] LP --> PR[Prompt Routing] LP --> PAR[Parallelisation] LP --> EO[Evaluator-Optimiser] LP --> CC[Context Cascading] LP --> RAG[RAG] style EO fill:#4a9ede,color:#fffRelated concepts:
- parallelisation — where the evaluator-optimiser iterates for quality, parallelisation runs tasks concurrently for speed; the two patterns are complementary
- human-in-the-loop — human review is the escalation path when automated evaluation reaches its limits or when subjective judgement is required
- guardrails — guardrails constrain what the generator can produce; the evaluator checks whether those constraints (and others) are met
Sources
Further reading
Resources
- Building Effective Agents (Anthropic) — The foundational reference on workflow patterns, with a clear definition of when the evaluator-optimizer pattern is the right fit
- Evaluator-Optimizer Loop: Continuous AI Agent Improvement (Hopx.ai) — Detailed walkthrough with code examples showing how to implement the generate-evaluate-refine loop in practice
- Design Patterns for Building Agentic Workflows (Hugging Face) — Broader catalogue of six design patterns showing where the evaluator-optimiser fits alongside chaining, routing, and parallelisation
- The Evaluator-Optimizer Loop: Evolutionary Search (Understanding Data) — Advanced perspective framing the evaluator-optimiser as an evolutionary search pattern applicable beyond LLMs
- Evaluator-Optimizer Pattern (Java AI Dev) — Concise implementation reference with Spring AI examples and clear architectural diagrams
Footnotes
-
Schluntz, E. and Zhang, B. (2024). Building Effective Agents. Anthropic. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Hopx.ai. (2025). Evaluator-Optimizer Loop: Continuous AI Agent Improvement. Hopx.ai. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
Carpintero, D. (2025). Design Patterns for Building Agentic Workflows. Hugging Face. ↩ ↩2 ↩3 ↩4 ↩5
-
Phoenix, J. (2026). The Evaluator-Optimizer Loop: Evolutionary Search for Anything You Can Judge. Understanding Data. ↩ ↩2
