Changelog
The full, ongoing changelog lives at CHANGELOG.md in the
repository root (CorvidLabs internal). The high-level story below
is a hand-curated summary suitable for public reading.
The high-level story:
Beta (June 2026)
- Merlin enters beta. After the invite-only alpha proved the agent loop and the safety gates on real work, Merlin is moving to beta. It's good enough for real work, Apple Silicon macOS first, with Linux to follow and Windows on the roadmap.
- Local memory encrypted at rest. Anything Merlin saves locally (its working memory) is now encrypted on disk under a per-device key held in the OS keychain. On-device by default; a content search still works by decrypting locally.
- Readable, polished desktop. Chat now renders agent replies as Markdown with syntax-highlighted code; body text follows the active theme so light themes stay readable; a stuck-panel-resize bug is fixed; and the Add-Panel picker only offers panels whose plugins are actually installed.
- Curated surfaces, stable contracts. Core / Advanced / Dev tiers keep developer-only commands out of the default help, and the machine-readable contracts (JSON / NDJSON output, exit codes, the fledge-v1 protocol version) are pinned by regression tests with a documented breaking-change policy.
Alpha progress leading into beta (v0.5 → v0.7)
- A Beta-ready reliability gate.
merlin confidence --gatedistills live-task and benchmark signals into a single Beta-ready verdict, and the Beta baseline is locked in. - Curated surfaces. The CLI, TUI, and desktop hide developer/experimental
surfaces by default (Core / Advanced / Dev tiers): the dev-only
benchandconfidencecommands and internal slash-commands are kept out of the default help. - Chat-focused desktop. The Chat panel is now conversation-only (you ↔ Merlin); tool and agent activity moved to its own open/close Activity panel, and the screenshot gallery was regenerated for the new layout.
- Frozen contracts. The CLI's
--output json/ NDJSON shapes, exit codes, and the fledge-v1 protocol version are pinned by tests with a documented breaking-change policy, so integrations stay stable. - Hardening + provider fixes. SSRF guards on image/web fetches, tightened destructive-op block-lists, the audit-chain key rooted in the OS keychain, an OpenAI tool-array cap fix, text-only provider image staging, and more.
- New tooling. A Gradle/Android build plugin, a drafted Homebrew formula, and the 1.0 distribution matrix + install checklist.
Current alpha track: app-first Merlin
- Private alpha only: the GitHub Pages site, README, and docs now describe the current Apple Silicon macOS alpha instead of implying a public installer is ready.
- Desktop is the primary product surface: onboarding, Providers & Keys, CLI install, Updates, Projects, and Chat are now the user-facing flow.
- Managed CLI: the app installs
~/.local/bin/merlinas a symlink to the bundled CLI and verifies runtime/plugin health withmerlin doctor. - Secure key management: provider readiness uses Merlin's credential
resolver, so env,
.env, and OS keychain/keyring sources are reported consistently. - Next release gate: signed/notarized Apple Silicon DMG, local GUI smoke,
fledge lanes run verify, and the macOS release-ready lane.
Sprint 1: Make the agent competent
- Typed tool schemas replace the brittle
args: stringround-trip with proper JSON Schemas declared per command inplugin.toml. - Real spec-aware planning: the agent reads relevant specs at task start and injects their constraints into the system prompt.
- Integration test harness with a scripted LLM provider gives the agent loop deterministic regression coverage.
Sprint 2: Make the agent feel alive
- Streaming output: text deltas render incrementally; structured events flush cleanly.
- Cancellation:
Ctrl+Creturns control immediately, dropping the in-flight LLM call; partial results returned with acancelled = trueflag. /modelslash command swaps providers mid-session.TaskResult.files_changedlists every file the task mutated.- README rewrite with the real Rust setup.
Sprint 4: Desktop panels + hard benchmarks
- 6 hard benchmark suites (48 tests):
advanced_reasoning,code_analysis,agent_tasks,stress_test,expert,architecture. Designed to separate Claude-level models from smaller ones. Tests include formal proofs, constraint satisfaction, security audits, concurrency bugs, and distributed systems design. - Total: 26 suites, 168 tests covering basic, long-session, adversarial, tool-augmented, refusal, and expert-level evaluation across 30+ providers.
- 7 Ollama Cloud models benchmarked: Devstral 2, Kimi K2.5, Qwen 3.5, Qwen3 Coder, Qwen3 Coder Next, DeepSeek V4 Flash, Gemma 4. Qwen3 Coder Next is the top scorer at 93% on hard suites.
- 6 new desktop panels (19 total):
- Test Runner: verify lane with per-step pass/fail tracking
- Log Viewer: buffered, filterable log viewer with severity
- Spec Viewer: browse module specs with drill-in detail
- Git: branch status, changed files, branch list
- Cost Tracker: per-session token usage + USD spend estimate
- Plugin Manager: view installed plugins, commands, dependencies
Sprint 3: Production-ready providers
- Benchmark system: 7 test suites (32 tests) evaluate provider
quality and latency. Results accumulate as JSON history.
merlin bench --historyshows the scorecard. - Secure credential storage:
merlin keysmanages API keys in the OS keychain (macOS Keychain, Linux secret-service, Windows Credential Manager). Resolution chain: env var → .env → keychain. - 17 pre-configured providers: Anthropic, OpenAI, 5 OpenRouter
variants (Sonnet, Haiku, Gemini, DeepSeek, Llama), Groq, Together,
7 Ollama Cloud models. One
OPENROUTER_API_KEYcovers 5 of them. - Live streaming validation: 8 integration tests hitting real provider APIs to verify streaming behavior end-to-end.
- Provider health checks:
merlin healthtests every configured provider with a real API call and reports latency/status. - Session management:
--resume,--sessions,--no-sessionflags. Sessions auto-cleanup based on configurable TTL. - Ollama temperature passthrough: temperature parameter now correctly forwarded to the Ollama API.
- Configurable chat path:
chat_pathfield in provider config for non-standard API endpoints.
Polish pass: CorvidLabs-style
agent.rs, plugin.rs, spec_loader.rs, and output.rs reorganized
with // MARK: - sections and doc comments. Agent::provider_info
returns a named ProviderInfo struct. Fluent setters return &mut Self. Public enums are #[non_exhaustive]. Magic numbers extracted
to constants. Helpers extracted from long methods.
v0.1.0: 2026-05-07
Initial functional release: agent loop, three LLM providers, five internal plugins, memory and AlgoChat adapters, working CLI.
For the full per-line breakdown, see CHANGELOG.md in the
repository.