Benchmarks
Merlin includes a built-in benchmark system that evaluates provider quality, latency, tool use, refusal behavior, and long-session reliability across 26 suites with 168 tests. Results accumulate over time, giving you historical data to make informed provider choices.
Looking for current numbers? The live, continuously-updated benchmarks dashboard lives on Merlin's own site: View the live benchmarks dashboard ↗. The scorecard below is a point-in-time snapshot for reference.
Running Benchmarks
merlin bench # run all suites, all providers
merlin bench --suite reasoning # specific suite
merlin bench --provider openai # specific provider
merlin bench --history # accumulated historical stats
Results are additive: each run accumulates into the history stored in
.fledge/benchmarks/ as timestamped JSON files.
Release Gate
For release readiness, Merlin ships an executable 1.0 Performance Gate. The gate checks provider p95 latency, adversarial bundle cost, tool-call budget, verify-lane median, streaming TTFT, required-provider coverage, and explicit release-manager waivers.
Provider Scorecard
Latest public Ollama Cloud sweep: 333 publishable suite runs across 26 suites and 20 providers. Kimi 2.6 is intentionally omitted from the public leaderboard until its benchmark behavior is reliable enough to compare fairly.
| Provider | Model | Suites | Passed | Avg Pass | Total Model Time |
|---|---|---|---|---|---|
ollama-deepseek-v4-flash | DeepSeek V4 Flash | 26 | 143/168 | 88% | 41.2m |
ollama | Qwen3 Coder 480B | 26 | 143/168 | 88% | 23.8m |
ollama-gemma4 | Gemma 4 31B | 26 | 140/168 | 87% | 31.2m |
ollama-kimi | Kimi K2.5 | 26 | 139/168 | 86% | 96.6m |
ollama-qwen-coder | Qwen3 Coder 480B | 26 | 140/168 | 85% | 79.6m |
ollama-nemotron-3-super | Nemotron 3 Super | 26 | 137/168 | 85% | 40.9m |
ollama-glm | GLM 4.7 | 26 | 133/168 | 83% | 66.6m |
ollama-qwen-coder-next | Qwen3 Coder Next | 26 | 131/168 | 82% | 46.5m |
ollama-glm-5 | GLM 5 | 25 | 126/163 | 82% | 89.2m |
ollama-devstral | Devstral 2 123B | 26 | 128/168 | 79% | 32.4m |
ollama-gpt-oss | GPT OSS 120B | 26 | 128/168 | 79% | 52.5m |
ollama-qwen35 | Qwen 3.5 397B | 23 | 103/152 | 70% | 187.8m |
These results reflect raw model capability on structured tasks: not end-to-end agent performance. Use the per-suite breakdown and release gate together: quality scores say whether a provider can solve the work; SLOs say whether it is fast and predictable enough to ship.
Per-Suite Breakdown
Each suite tests a different capability. Here's how the current shortlist performs across the most discriminating suites:
| Suite | ollama | DeepSeek V4 Flash | Gemma 4 | Nemotron 3 Super | Qwen Coder Next | GLM 5 |
|---|---|---|---|---|---|---|
advanced_reasoning | 6/8 | 8/8 | 6/8 | 8/8 | 5/8 | 8/8 |
agent_tasks | 7/7 | 7/7 | 7/7 | 6/7 | 7/7 | 7/7 |
architecture | 6/6 | 5/6 | 5/6 | 5/6 | 5/6 | 5/6 |
code_analysis | 7/7 | 7/7 | 7/7 | 6/7 | 7/7 | 6/7 |
expert | 7/8 | 7/8 | 6/8 | 6/8 | 5/8 | 6/8 |
hard_mode | 6/10 | 7/10 | 7/10 | 8/10 | 8/10 | 6/10 |
nightmare_mode | 8/10 | 5/10 | 7/10 | 3/10 | 9/10 | 6/10 |
refusal | 4/6 | 5/6 | 6/6 | 4/6 | 4/6 | 6/6 |
stress_test | 8/8 | 8/8 | 5/8 | 7/8 | 4/8 | 7/8 |
tool_usage | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 |
The per-suite breakdown reveals patterns invisible in aggregate scores. A provider scoring 95% overall might be failing every tool-usage test while acing everything else: which matters if your workflow is tool-heavy.
Test Suites
| Suite | Tests | What it measures | Details |
|---|---|---|---|
advanced_reasoning | 8 | Hard logic, constraints, mathematical reasoning | Multi-step chains and constraint satisfaction |
agent_tasks | 7 | Agent workflows | Planning, recovery, ambiguity handling |
architecture | 6 | Staff-level system design | Distributed systems and trade-off analysis |
basic | 3 | Instruction following, format compliance | Greeting, math, list formatting |
claude_code_comparison | 7 | Coding-agent comparison tasks | Code analysis, refactoring, spec compliance |
code_analysis | 7 | Code review and comprehension | Security, concurrency, performance, multi-function reasoning |
coding | 5 | Code generation, bug detection | Palindromes, ownership, regex, debugging |
communication | 5 | Clarity, conciseness | Summarization, rewriting, persona |
context_management | 7 | Agent context discipline | Recall, synthesis, efficient tool use |
design | 5 | Architecture, trade-offs | WebSocket vs polling, caching strategies |
domain | 5 | Domain knowledge | TOML, fledge commands, spec format |
engineering | 5 | Realistic coding-agent work | Bug finding, refactoring, debugging, code review |
expert | 8 | Expert-level reasoning | Proofs, translation, algorithms, dependency analysis |
hard_mode | 10 | Adversarial frontier-model weak spots | False positives, precision, type theory |
hard_mode_augmented | 10 | Hard mode with focused tools | Measures how scoped fledge plugins close gaps |
long_session | 4 | Long-running session fidelity | Drift, rereads, decision amnesia, contradiction handling |
multi_turn | 5 | Multi-turn context, recall | State tracking, iterative refinement |
nightmare_mode | 10 | Tests designed to break frontier models | Cascading logic, traps, confident wrong-answer pressure |
nightmare_mode_augmented | 10 | Nightmare mode with focused tools | Verification with calc, Python, typecheck, and related plugins |
oneshot_games | 5 | Complete one-shot CLI games | Static checks plus executed Python behavior checks |
reasoning | 5 | Logic, math, deduction | Sequences, syllogisms, word problems |
refusal | 6 | Prompt-level safety behavior | Destructive requests, prompt injection, forged authority |
reliability | 7 | Production-readiness traits | Format compliance, structured output, self-verification |
roleplaying | 5 | Persona consistency | Mentor, auditor, skeptic, and role-specific framing |
stress_test | 8 | Adversarial precision | Instruction resistance, arithmetic, tabular reasoning |
tool_usage | 5 | Structured output, JSON | JSON objects, tool call format, CSV |
Writing Custom Suites
Suites are TOML files in benchmarks/suites/. Each test specifies a
prompt, expected behavior, and validation checks:
[[test]]
name = "fibonacci"
prompt = "Write a Rust function that returns the nth Fibonacci number"
system = "Write only code. No explanation."
max_tokens = 256
temperature = 0.0
[[test.checks]]
type = "contains"
value = "fn "
[[test.checks]]
type = "contains"
value = "fibonacci"
[[test.checks]]
type = "not_empty"
Run your custom suite:
merlin bench --suite my_custom_suite
The release gate consumes the same raw benchmark JSON. See 1.0 Performance Gate for the exact SLO behavior.