Changelog

The full, ongoing changelog lives at CHANGELOG.md in the repository root (CorvidLabs internal). The high-level story below is a hand-curated summary suitable for public reading.

The high-level story:

Beta (June 2026)

Merlin enters beta. After the invite-only alpha proved the agent loop and the safety gates on real work, Merlin is moving to beta. It's good enough for real work, Apple Silicon macOS first, with Linux to follow and Windows on the roadmap.
Local memory encrypted at rest. Anything Merlin saves locally (its working memory) is now encrypted on disk under a per-device key held in the OS keychain. On-device by default; a content search still works by decrypting locally.
Readable, polished desktop. Chat now renders agent replies as Markdown with syntax-highlighted code; body text follows the active theme so light themes stay readable; a stuck-panel-resize bug is fixed; and the Add-Panel picker only offers panels whose plugins are actually installed.
Curated surfaces, stable contracts. Core / Advanced / Dev tiers keep developer-only commands out of the default help, and the machine-readable contracts (JSON / NDJSON output, exit codes, the fledge-v1 protocol version) are pinned by regression tests with a documented breaking-change policy.

Alpha progress leading into beta (v0.5 → v0.7)

A Beta-ready reliability gate. merlin confidence --gate distills live-task and benchmark signals into a single Beta-ready verdict, and the Beta baseline is locked in.
Curated surfaces. The CLI, TUI, and desktop hide developer/experimental surfaces by default (Core / Advanced / Dev tiers): the dev-only bench and confidence commands and internal slash-commands are kept out of the default help.
Chat-focused desktop. The Chat panel is now conversation-only (you ↔ Merlin); tool and agent activity moved to its own open/close Activity panel, and the screenshot gallery was regenerated for the new layout.
Frozen contracts. The CLI's --output json / NDJSON shapes, exit codes, and the fledge-v1 protocol version are pinned by tests with a documented breaking-change policy, so integrations stay stable.
Hardening + provider fixes. SSRF guards on image/web fetches, tightened destructive-op block-lists, the audit-chain key rooted in the OS keychain, an OpenAI tool-array cap fix, text-only provider image staging, and more.
New tooling. A Gradle/Android build plugin, a drafted Homebrew formula, and the 1.0 distribution matrix + install checklist.

Current alpha track: app-first Merlin

Private alpha only: the GitHub Pages site, README, and docs now describe the current Apple Silicon macOS alpha instead of implying a public installer is ready.
Desktop is the primary product surface: onboarding, Providers & Keys, CLI install, Updates, Projects, and Chat are now the user-facing flow.
Managed CLI: the app installs ~/.local/bin/merlin as a symlink to the bundled CLI and verifies runtime/plugin health with merlin doctor.
Secure key management: provider readiness uses Merlin's credential resolver, so env, .env, and OS keychain/keyring sources are reported consistently.
Next release gate: signed/notarized Apple Silicon DMG, local GUI smoke, fledge lanes run verify, and the macOS release-ready lane.

Sprint 1: Make the agent competent

Typed tool schemas replace the brittle args: string round-trip with proper JSON Schemas declared per command in plugin.toml.
Real spec-aware planning: the agent reads relevant specs at task start and injects their constraints into the system prompt.
Integration test harness with a scripted LLM provider gives the agent loop deterministic regression coverage.

Sprint 2: Make the agent feel alive

Streaming output: text deltas render incrementally; structured events flush cleanly.
Cancellation: Ctrl+C returns control immediately, dropping the in-flight LLM call; partial results returned with a cancelled = true flag.
/model slash command swaps providers mid-session.
TaskResult.files_changed lists every file the task mutated.
README rewrite with the real Rust setup.

Sprint 4: Desktop panels + hard benchmarks

6 hard benchmark suites (48 tests): advanced_reasoning, code_analysis, agent_tasks, stress_test, expert, architecture. Designed to separate Claude-level models from smaller ones. Tests include formal proofs, constraint satisfaction, security audits, concurrency bugs, and distributed systems design.
Total: 26 suites, 168 tests covering basic, long-session, adversarial, tool-augmented, refusal, and expert-level evaluation across 30+ providers.
7 Ollama Cloud models benchmarked: Devstral 2, Kimi K2.5, Qwen 3.5, Qwen3 Coder, Qwen3 Coder Next, DeepSeek V4 Flash, Gemma 4. Qwen3 Coder Next is the top scorer at 93% on hard suites.
6 new desktop panels (19 total):
- Test Runner: verify lane with per-step pass/fail tracking
- Log Viewer: buffered, filterable log viewer with severity
- Spec Viewer: browse module specs with drill-in detail
- Git: branch status, changed files, branch list
- Cost Tracker: per-session token usage + USD spend estimate
- Plugin Manager: view installed plugins, commands, dependencies

Sprint 3: Production-ready providers

Benchmark system: 7 test suites (32 tests) evaluate provider quality and latency. Results accumulate as JSON history. merlin bench --history shows the scorecard.
Secure credential storage: merlin keys manages API keys in the OS keychain (macOS Keychain, Linux secret-service, Windows Credential Manager). Resolution chain: env var → .env → keychain.
17 pre-configured providers: Anthropic, OpenAI, 5 OpenRouter variants (Sonnet, Haiku, Gemini, DeepSeek, Llama), Groq, Together, 7 Ollama Cloud models. One OPENROUTER_API_KEY covers 5 of them.
Live streaming validation: 8 integration tests hitting real provider APIs to verify streaming behavior end-to-end.
Provider health checks: merlin health tests every configured provider with a real API call and reports latency/status.
Session management: --resume, --sessions, --no-session flags. Sessions auto-cleanup based on configurable TTL.
Ollama temperature passthrough: temperature parameter now correctly forwarded to the Ollama API.
Configurable chat path: chat_path field in provider config for non-standard API endpoints.

Polish pass: CorvidLabs-style

agent.rs, plugin.rs, spec_loader.rs, and output.rs reorganized with // MARK: - sections and doc comments. Agent::provider_info returns a named ProviderInfo struct. Fluent setters return &mut Self. Public enums are #[non_exhaustive]. Magic numbers extracted to constants. Helpers extracted from long methods.

v0.1.0: 2026-05-07

Initial functional release: agent loop, three LLM providers, five internal plugins, memory and AlgoChat adapters, working CLI.

For the full per-line breakdown, see CHANGELOG.md in the repository.