Benchmarks

Merlin includes a built-in benchmark system that evaluates provider quality, latency, tool use, refusal behavior, and long-session reliability across 27 suites with 178 tests. Results accumulate over time, giving you historical data to make informed provider choices.

Looking for current numbers? The live, continuously-updated benchmarks dashboard is hosted here: View the live benchmarks dashboard →. The scorecard below is a point-in-time snapshot for reference.

Running Benchmarks

merlin bench                         # run all suites, all providers
merlin bench --suite reasoning       # specific suite
merlin bench --provider openai       # specific provider
merlin bench --history               # accumulated historical stats

Results are additive: each run accumulates into the history stored in .fledge/benchmarks/ as timestamped JSON files.

Release Gate

For release readiness, Merlin ships an executable 1.0 Performance Gate. The gate checks provider p95 latency, adversarial bundle cost, tool-call budget, verify-lane median, streaming TTFT, required-provider coverage, and explicit release-manager waivers.

Provider Scorecard

Latest public Ollama Cloud sweep: 333 publishable suite runs across 26 suites and 20 providers. The newest suite, merlin_agentic, landed after this sweep, so its scores so far live in the shortlist breakdown below rather than this 20-provider table. Kimi 2.6 is intentionally omitted from the public leaderboard until its benchmark behavior is reliable enough to compare fairly.

Provider	Model	Suites	Passed	Avg Pass	Total Model Time
`ollama-deepseek-v4-flash`	DeepSeek V4 Flash	26	143/168	88%	41.2m
`ollama`	Qwen3 Coder 480B	26	143/168	88%	23.8m
`ollama-gemma4`	Gemma 4 31B	26	140/168	87%	31.2m
`ollama-kimi`	Kimi K2.5	26	139/168	86%	96.6m
`ollama-qwen-coder`	Qwen3 Coder 480B	26	140/168	85%	79.6m
`ollama-nemotron-3-super`	Nemotron 3 Super	26	137/168	85%	40.9m
`ollama-glm`	GLM 4.7	26	133/168	83%	66.6m
`ollama-qwen-coder-next`	Qwen3 Coder Next	26	131/168	82%	46.5m
`ollama-glm-5`	GLM 5	25	126/163	82%	89.2m
`ollama-devstral`	Devstral 2 123B	26	128/168	79%	32.4m
`ollama-gpt-oss`	GPT OSS 120B	26	128/168	79%	52.5m
`ollama-qwen35`	Qwen 3.5 397B	23	103/152	70%	187.8m

These results reflect raw model capability on structured tasks: not end-to-end agent performance. Use the per-suite breakdown and release gate together: quality scores say whether a provider can solve the work; SLOs say whether it is fast and predictable enough to ship.

Per-Suite Breakdown

Each suite tests a different capability. Here's how the current shortlist performs across the most discriminating suites:

Suite	`ollama`	DeepSeek V4 Flash	Gemma 4	Nemotron 3 Super	Qwen Coder Next	GLM 5
`advanced_reasoning`	6/8	8/8	6/8	8/8	5/8	8/8
`agent_tasks`	7/7	7/7	7/7	6/7	7/7	7/7
`architecture`	6/6	5/6	5/6	5/6	5/6	5/6
`code_analysis`	7/7	7/7	7/7	6/7	7/7	6/7
`expert`	7/8	7/8	6/8	6/8	5/8	6/8
`hard_mode`	6/10	7/10	7/10	8/10	8/10	6/10
`merlin_agentic`	7/10	8/10	8/10	8/10	9/10	9/10
`nightmare_mode`	8/10	5/10	7/10	3/10	9/10	6/10
`refusal`	4/6	5/6	6/6	4/6	4/6	6/6
`stress_test`	8/8	8/8	5/8	7/8	4/8	7/8
`tool_usage`	5/5	5/5	5/5	5/5	5/5	5/5

merlin_agentic is the newest and hardest suite - it scores the agent loop driving real plugins, not text in/out, and only passes when the model actually invokes the tool and writes the correct file. Across the full Ollama lineup the standout was gpt-oss:120b (10/10), with kimi-k2.7-code and qwen3-coder-next at 9/10. For an Ollama-only setup those three make a good trio of distinct model families - so a multi-model council gets genuinely different reasoning rather than one model in three seats. See the live shootout → for the full board.

The per-suite breakdown reveals patterns invisible in aggregate scores. A provider scoring 95% overall might be failing every tool-usage test while acing everything else: which matters if your workflow is tool-heavy.

Test Suites

Suite	Tests	What it measures	Details
`advanced_reasoning`	8	Hard logic, constraints, mathematical reasoning	Multi-step chains and constraint satisfaction
`agent_tasks`	7	Agent workflows	Planning, recovery, ambiguity handling
`architecture`	6	Staff-level system design	Distributed systems and trade-off analysis
`basic`	3	Instruction following, format compliance	Greeting, math, list formatting
`claude_code_comparison`	7	Coding-agent comparison tasks	Code analysis, refactoring, spec compliance
`code_analysis`	7	Code review and comprehension	Security, concurrency, performance, multi-function reasoning
`coding`	5	Code generation, bug detection	Palindromes, ownership, regex, debugging
`communication`	5	Clarity, conciseness	Summarization, rewriting, persona
`context_management`	7	Agent context discipline	Recall, synthesis, efficient tool use
`design`	5	Architecture, trade-offs	WebSocket vs polling, caching strategies
`domain`	5	Domain knowledge	TOML, fledge commands, spec format
`engineering`	5	Realistic coding-agent work	Bug finding, refactoring, debugging, code review
`expert`	8	Expert-level reasoning	Proofs, translation, algorithms, dependency analysis
`hard_mode`	10	Adversarial frontier-model weak spots	False positives, precision, type theory
`hard_mode_augmented`	10	Hard mode with focused tools	Measures how scoped fledge plugins close gaps
`long_session`	4	Long-running session fidelity	Drift, rereads, decision amnesia, contradiction handling
`merlin_agentic`	10	The agent loop driving real fledge plugins	Asserts the model actually invoked the tool and wrote the correct artifact (`tool_used` / `file_contains`) - not answer-from-memory. Tuned harder than `hard_mode`.
`multi_turn`	5	Multi-turn context, recall	State tracking, iterative refinement
`nightmare_mode`	10	Tests designed to break frontier models	Cascading logic, traps, confident wrong-answer pressure
`nightmare_mode_augmented`	10	Nightmare mode with focused tools	Verification with calc, Python, typecheck, and related plugins
`oneshot_games`	5	Complete one-shot CLI games	Static checks plus executed Python behavior checks
`reasoning`	5	Logic, math, deduction	Sequences, syllogisms, word problems
`refusal`	6	Prompt-level safety behavior	Destructive requests, prompt injection, forged authority
`reliability`	7	Production-readiness traits	Format compliance, structured output, self-verification
`roleplaying`	5	Persona consistency	Mentor, auditor, skeptic, and role-specific framing
`stress_test`	8	Adversarial precision	Instruction resistance, arithmetic, tabular reasoning
`tool_usage`	5	Structured output, JSON	JSON objects, tool call format, CSV

Writing Custom Suites

Suites are TOML files in benchmarks/suites/. Each test specifies a prompt, expected behavior, and validation checks:

[[test]]
name = "fibonacci"
prompt = "Write a Rust function that returns the nth Fibonacci number"
system = "Write only code. No explanation."
max_tokens = 256
temperature = 0.0

[[test.checks]]
type = "contains"
value = "fn "

[[test.checks]]
type = "contains"
value = "fibonacci"

[[test.checks]]
type = "not_empty"

Run your custom suite:

merlin bench --suite my_custom_suite

The release gate consumes the same raw benchmark JSON. See 1.0 Performance Gate for the exact SLO behavior.