Bitter Frontier

The Policy You Wrote Wasn't the Policy You Had

Wed, 03 Jun 2026 00:00:00 GMT

The Policy You Wrote Wasn't the Policy You Had

Seven days, ten providers, one uncomfortable theme: the headline this week is not new capability. It is the gap between the policy an operator configured and the policy the runtime actually enforced -- and how many providers spent the window quietly closing it.

A Claude Code operator who wrote a Read-deny rule to hide a secret file was still leaking it through Glob and Grep. A Pi user authenticating against an OAuth server could be handed a verification URI that ran shell commands. A Hermes Docker dashboard could drop its auth because a heuristic misread the bind host. A Gemini CLI MCP blacklist could be bypassed. None of these were the operator's misconfiguration. The rules were written; the enforcement silently wasn't there. This week, across Claude Code, Gemini CLI, Pi, OpenHands, Hermes, and Flue, the same class of fix landed: restore the enforcement the operator already believed was in place.

The quieter, more forward-looking thread is the inverse of a gap-close: skills and plugins became governed, auditable, sometimes agent-activated resources across four providers in parallel -- Paperclip, OpenClaw, Flue, and Agent Zero. Capability that used to be an ambient default is becoming reviewable operating state.

Security Advisories: Check These Before Upgrading

Claude Code 2.1.160--2.1.162: three permission-bypass gaps closed at once. Custom WebFetch permission rules now override the built-in preapproved-domain whitelist; Windows permission rules with backslashes or case-variant paths now match; and Read-deny rules now hide files from Glob and Grep results. The sharpest of the three is the last one: a file an operator denied for Read was still discoverable -- path and contents surfaceable -- through search tools, defeating the access-control intent. The attacker model is prompt-injection or compromised task content steering the agent toward a denied domain or walled-off path; the fix is gated purely on upgrading, so the operator action is upgrade, then re-audit whether any policy was silently bypassed in the prior window, especially on Windows and any setup relying on Read-deny to hide secrets from search. The changelog ships this as an ordinary entry; treat it as the advisory it is.

Claude Code 2.1.160: execution-granting config writes now prompt even in acceptEdits mode. Two guardrails land together. acceptEdits mode now prompts before writing build-tool config that grants code execution (.npmrc, .yarnrc*, bunfig.toml, .bazelrc, .pre-commit-config.yaml, .devcontainer/), and the agent now prompts before writing shell startup files (.zshenv, .zlogin, .bash_login) and ~/.config/git/. Operators running acceptEdits or auto-leaning modes previously had a silent write path into files that execute on the next shell login, install, or commit -- the classic agent-persistence and supply-chain escalation vector. The prompt is the guardrail here; blanket- allowing it puts you back where you started.

Pi: OAuth command injection and git-package path traversal closed. Commit ba6e529 validates OAuth verification URIs (rejecting non-HTTP(S) schemes) and launches the browser via spawn() instead of shell exec(), closing a path where a malicious OAuth server could inject $(id>/tmp/pwned)-style commands. Commit a98e087 rejects git URLs with .., null bytes, backslashes, or leading slashes at both parse and resolution time, blocking writes outside the package install root. The attacker is whoever controls the OAuth server or authors the git package; both fixes need no config change, only the upgrade.

OpenHands main: three named CVEs. No tagged release fell in the window, but main closed CVE-2026-44492 (axios 1.16.0), CVE-2026-41238 (dompurify 3.4.0), and CVE-2026-42305 (dulwich 1.2.5). The first two are browser-facing (HTTP client and HTML/DOM sanitizer) and need a frontend rebuild and redeploy; the third is a backend git library and needs a poetry.lock re-resolve and image rebuild. Self-hosters pinning older lockfiles must bump manually.

Gemini CLI v0.45.0: MCP blacklist bypass fixed. The stable release bundles Termux relaunch/resize fixes, session-context filtering on history resume, and -- the security-bearing item -- a fix for a path where a blacklisted MCP tool or server could still be reached. Operators relying on MCP deny-lists for containment should upgrade before trusting the blacklist, and test that blacklisted tools are actually unreachable rather than assume full coverage.

Hermes v0.15.1: Docker insecure binding is now an explicit opt-in. The dashboard no longer infers insecure mode from the bind host; it requires HERMES_DASHBOARD_INSECURE=1 explicitly. This removes a silent path where a misread bind host dropped auth and exposed the dashboard to a network-adjacent attacker. Existing Docker and hosted setups must update env config before upgrading. The same patch fixes a v0.15.0 loopback-mode dashboard reload loop and restores MCP bare-command resolution (npx, npm, node) in Docker.

Paperclip v2026.529.0: first-admin claim is now the bootstrap gate. Unclaimed self-hosted deployments get a one-time browser claim to create the first admin. The flip side is a race: whoever completes the claim first becomes admin, so an attacker with network reach to a freshly stood-up instance could seize control before the legitimate operator. Claim promptly and restrict network exposure during the unclaimed window.

Hermes v0.15.0: Promptware defense, and a migration. The Velocity Release adds a built-in defense against Brainworm-class prompt-injection and closes 19 security-tagged issues. Operators running against untrusted content (web, repos, MCP output) should validate the defense against their own injection corpus rather than assume blanket coverage; novel vectors outside the known class may still pass.

Flue v0.9.0: a hard breaking migration. Routing imports move from @flue/runtime/app to @flue/runtime/routing, provider model values now require provider-id/model-id format, SDK mount paths derive from baseUrl, and persisted beta session state is rejected -- clear or migrate the store before upgrading or sessions fail to restore. Cloudflare Durable Object migrations are no longer auto-appended; the operator now owns them in the Wrangler config, and interrupted workflows no longer auto-retry.

The Enforcement Gap, Six Ways

The thread that cuts across the watchlist is consistent enough to name plainly. In each case, a control the operator had reason to believe was active was not -- and the fix is the same shape: make the enforcement match the configuration.

The Claude Code cluster is the clearest statement of it. A Read-deny rule that didn't hide files from Glob/Grep, a WebFetch rule that didn't override the preapproved-domain list, and Windows path rules that silently didn't match on case or separator variance are three independent ways the same promise -- "the policy I wrote is enforced" -- was broken. The same release line also converts a silent config-write into a confirmation checkpoint for files that grant code execution, and corrects an over-broad managed-settings policy that was wrongly blocking legitimate third-party provider sessions.

Gemini CLI's MCP blacklist bypass is the same bug class at the tool layer: a deny-list that didn't deny. Its companion policy-file resilience fix closes a fail-open gap where a policy file that failed to persist (on cross-device container mounts) or failed to parse (corrupt TOML) could leave the agent running without the operator's intended policy in effect. Recovery now writes a .bak and rebuilds to defaults, so a corrupted policy is silently reset -- re-verify intended policy after a .bak appears.

Pi's quartet of hardening commits -- OAuth injection, git path traversal, auth files created at 0o600 instead of briefly world-readable, and extension cache moved out of world-accessible /tmp -- is the multi-user-host version of the same theme: close the windows where a control was assumed but a co-tenant could slip through. Flue's v0.9.1 WebSocket credential hardening strips query strings and fragments before persisting Cloudflare attachments so URL-carried handshake credentials are not retained, and OpenHands moved ACP provider credentials off the plaintext acp_env channel onto an encrypted secrets channel. And Hermes closed the same shape at the deployment edge: a Docker dashboard that silently dropped auth when a heuristic misread the bind host now demands an explicit HERMES_DASHBOARD_INSECURE=1.

The operator takeaway is uncomfortable but actionable: an upgrade is not just a feature bump this week. For every provider above, the safe assumption is that some control you configured on the prior build was not holding, and the post-upgrade action is a re-audit, not a victory lap.

Skills and Plugins Become Governed State

The second thread runs the other way. Four providers, four surfaces, shipped the same move: agent capability stops being an ambient default and becomes reviewable, sometimes approvable, operating state.

Paperclip made company skills first-class resources with an install / reset / audit / export / assign CLI. The load-bearing verbs are audit and export -- which skills an agent holds becomes a queryable, exportable fact rather than implicit config -- and assign, which makes a capability grant a distinct, reviewable authority action.

OpenClaw's Skill Workshop inserts a human-in-the-loop gate: new skills enter a pending-proposal queue reviewed via CLI or Gateway before taking effect. A new skill_workshop agent tool lets agents file proposals themselves, which widens the surface proposals originate from -- so the operator decision is who may review and who may self-approve. Lax review re-opens the unreviewed-skill path.

Flue v0.9.2 went the other direction on activation authority: an activate_skill tool lets agents load full skill instructions on demand before matching work. The operator's visible control narrows to which skills are configured; the choice to activate moves to the agent. Workspace skills are reread on activation, so mid-session edits take effect.

Agent Zero v1.19 made Office, Desktop, and Editor plugins toggleable behind a protected plugin-state API -- a real authority lever that lets an operator disable powerful capabilities (Desktop computer-use especially) on deployments that should not have them. The release note describes a "protected" toggle endpoint but no auth model or role-based capability management, so treat it as a disable lever, not yet an audited capability register.

The shapes differ -- catalog audit, proposal approval, agent self-activation, capability toggle -- but the direction is one: the question "what can this agent do?" is becoming answerable by inspecting state rather than reading code or trusting defaults.

The accessibility read is mixed. Claude Code's waitingFor field and fan-out progress counter make agent state legible to operators who previously had to open each session; Flue v0.9.0, by contrast, forces a hard migration with no automated path, raising rather than lowering the cost of staying current. The week made harnesses more governable, not more reachable.

Control Plane

Control plane saw the most movement, in two directions. The governance-of-capability cluster above (Paperclip skills, OpenClaw Skill Workshop, Agent Zero plugin toggles, Flue agent-activated skills) sits here, as does a steady relocation of authority onto standing credentials and cloud paths: Codex remote-exec API-key host registration, Hermes Bitwarden Secrets Manager replacing per-provider keys, Claude Code Auto Mode reaching Bedrock/Vertex/Foundry, and Codex models running under AWS IAM via Bedrock. Claude Code also made agent supervision more legible: claude agents --json now exposes a waitingFor field naming what a blocked session waits on (e.g. a permission prompt), plus a done/total fan-out progress counter.

Runtime

Runtime carried most of the enforcement-gap closures -- the Claude Code config-write prompts, Pi's OAuth and path-traversal fixes, Flue's WebSocket credential stripping, Hermes's Promptware defense, OpenHands's dulwich CVE -- plus one notable posture reversal: Agent Zero reverted computer-use screenshots to durable chat-scoped storage, undoing its prior ephemeral-by-default stance. That improves audit trails but persists potentially sensitive on-screen content (credentials, PII, internal UIs) with no automatic redaction -- a data-at-rest exposure operators must scope and prune.

Platform

Platform was mostly steady-state plumbing: the OpenHands frontend CVE cluster, Gemini CLI's v0.45.0 stable bundle and an editor-spam-loop fix, OpenClaw's MiniMax M3 model support, and Flue's OpenTelemetry tracing package. Codex's Sites plugin -- in-app website/web-app creation and deployment, included by default in Business workspaces -- is the one platform item with a governance edge: a deploy capability may already be active without an explicit enablement step.

Provider Notes

Codex (CLI 0.135.0--0.136.0, iOS 1.2026.146) shipped named permission profiles with custom-config display and codex doctor diagnostics (0.135.0), a non-interactive installer for CI, plus remote-exec API-key host registration and thread archiving (0.136.0). The iOS app added an optional Face ID / passcode lock for Codex and SSH-to-Windows. Two integrations landed: the Sites plugin and Amazon Bedrock under AWS-managed auth and billing.

Claude Code (2.1.158--2.1.162) is the enforcement-gap headliner: the permission/deny-rule cluster, execution-granting config-write prompts, the managed-settings third-party-session unblock, agent-status observability, and Auto Mode reaching Bedrock/Vertex/Foundry for Opus 4.7/4.8.

Gemini CLI (v0.44.1--v0.46.0-preview) shipped the v0.45.0 stable bundle with the MCP blacklist fix and Termux hardening, policy-file resilience, and a server-flag-gated Gemini 3.5 Flash GA rollout that decouples model-in-use from client version. A CI change to pull_request_target on the PR-size labeler is low-risk as written (it only reads line counts) but removes the structural safety of pull_request -- any future edit adding fork-code checkout becomes immediately dangerous.

Hermes Agent (v0.15.0--v0.15.1 + post-release commits) is the Velocity Release: a 76% run_agent.py refactor, Kanban evolving into a multi-agent orchestration platform with auto-decomposition, swarm topology, and worktree-per-task, Promptware defense, and Bitwarden Secrets Manager. The v0.15.1 patch fixes the Docker insecure-binding opt-in and a dashboard reload loop; June 3 commit waves hardened installer self-update, Windows/WSL2 PTY and schtasks handling, and desktop session management.

Pi coding agent (commits to main) shipped a security-hardening cluster: OAuth launch hardening, git path-traversal rejection, auth-file mode-on-create (0o600 instead of briefly world-readable), extension-cache isolation out of world-accessible /tmp, and HTML-export XSS sanitization. Alongside, model-catalog maintenance removed stale Codex entries, added Mistral Devstral 2 and Open Mistral Nemo, and refreshed Claude model pricing and token caps to 128k output. No reliably in-window tagged release landed; the security work shipped as commits to main.

OpenClaw (2026.5.31-beta.3 through 2026.6.1 stable) shipped the Skill Workshop proposal workflow, interrupted-tool-call recovery, bounded request timers (re-evaluate SLOs), enhanced plugin isolation, MiniMax M3, and Tailscale Serve service-name binding with SQLite-backed state migration for iMessage and plugin-install tracking.

Paperclip (v2026.529.0) shipped the skills CLI/catalog, the first-admin claim flow, inline document annotations, per-user sidebar controls, and live Claude model discovery from the UI.

Agent Zero (v1.19) renamed Remote Link to Remote Control with selectable tunnel providers, made Office/Desktop/Editor plugins toggleable behind a protected API, reverted screenshots to durable chat-scoped storage, unified OAuth account management, and hardened Xpra desktop control.

OpenHands (main, no tagged release) shipped the three-CVE remediation cluster (axios, dompurify, dulwich), the ACP-credentials-to-secrets-channel move, a cascade-delete-sole-org-requester change on DELETE /api/organizations (org deletion now also deletes the requesting user if it is their only org), a git-proxy capability, and a LiteLLM 1.84.1 upgrade.

Flue (Tier 2; v0.8.1--v0.9.2) shipped OpenTelemetry tracing, the v0.9.0 breaking migration, WebSocket credential hardening, operator-owned workflow-run retention (the implicit 50-run prune is gone), and autonomous activate_skill.

What To Try

Claude Code operators: upgrade to 2.1.162 and re-audit any allow/deny or Read-deny policy that ran on older builds. Then wire waitingFor and the fan-out progress counter into supervision tooling so stuck-agent triage stops requiring a human to open each session.
Paperclip operators: use the skills CLI to audit and export which agents hold which skills, and claim any freshly stood-up self-hosted instance immediately.
Codex operators on iOS: enable the Face ID / passcode lock before treating mobile as a trusted access surface.
Hermes operators: queue a decomposable task on the new multi-agent Kanban and confirm the orchestrator spawns the expected sub-agents in isolated worktrees before trusting it with real work. Set HERMES_DASHBOARD_INSECURE=1 only where insecure binding is genuinely intended.
Agent Zero operators: disable the Office/Desktop/Editor plugins you do not need, and review retention/access controls for the now-durable computer-use screenshots before capturing sensitive screens.
Gemini CLI maintainers: review the pull_request_target labeler workflow to confirm it only reads PR metadata and never checks out fork code under the elevated token.

What Remains Uncertain

Codex remote-exec key lifecycle: scope, rotation, and revocation for the approved-host API-key registration are undocumented. Whether a leaked key grants persistent remote exec is unverified.
Codex iOS SSH trust handling: host-key verification, key storage, and scoping of the iOS SSH-to-Windows client are not described.
Gemini CLI model routing: with Flash GA gated server-side, the model in use is no longer determined by client version alone -- backend flag state is now part of the audit surface.
OpenClaw plugin-isolation depth: the release note asserts tighter isolation but does not describe the boundary's depth, so operators cannot verify it from the receipt.
Hermes Promptware coverage: the defense targets a known attack class; novel injection vectors outside Brainworm patterns may still pass. Validate against your own corpus.
Flue persisted-session migration: v0.9.0 rejects pre-upgrade session state with no automated migration path; a self-scripted migration could reintroduce stale, unredacted state.
OpenHands org-deletion blast radius: operators on the cascade-delete change should enforce backups before any DELETE /api/organizations, since a sole-org delete now removes the user identity too.

Auto Stops Asking

Wed, 27 May 2026 00:00:00 GMT

Auto Stops Asking

Fifteen days, ten providers, one direction. The change that cuts across the watchlist this fortnight is uncomfortable to ignore: autonomy stopped asking for permission.

Claude Code 2.1.152 flipped Auto mode from opt-in to default. Codex 26.519 graduated goal mode out of experimental and turned it on by default across app, IDE, and CLI. Gemini CLI v0.44.0 collapsed multiple Auto variants into a single mode and added shell-redirect auto-approval in AUTO_EDIT. Three providers, three surfaces, one shape: the permission ceremony that used to gate productive autonomy is no longer the default surface. Operators don't choose to enable autonomy; they decide how to constrain it.

The other half of the fortnight is the policy substrate that move requires. Codex CLI 0.133.0 shipped permission profile inheritance and a managed requirements.toml enforcement file consulted by the runtime. Gemini CLI integrated PolicyEngine into ACP sessions, reaching enforcement into the protocol layer. OpenHands shipped org-level LLM profiles with two-tier permissions and concurrency-safe activation. Three different products, three different surfaces, one direction: policy lives in versioned, org-managed files now — not in per-session flags.

These themes are not independent. Autonomy moving from opt-in to baseline makes per-session permission grants intractable. The policy file is the correct primitive when the operator's decision is "constrain the baseline" rather than "consent to each escalation."

Breaking Changes: Check These Before Upgrading

Claude Code v2.1.149: a PowerShell permission bypass and a worktree sandbox scope bug. Windows operators with PowerShell allowlists are affected by PowerShell built-in cd functions (cd.., cd\, cd~, X:) defeating the workspace boundary undetected. Git worktree workflows are affected by the sandbox write allowlist over-scoping the main repository root instead of the shared .git directory. Anthropic ships these as ordinary changelog entries; the changelog is the de-facto advisory surface, but no separate page exists. Upgrade past 2.1.149 before deploying. v2.1.147 closes adjacent forceLoginOrgUUID and forceLoginMethod enforcement gaps against third-party-provider and API-key sessions; v2.1.148 closes a Vertex AI provider bypass.

Claude Code v2.1.152: Auto mode no longer requires opt-in consent. Auto mode — the permission classifier that runs safe actions without prompting and blocks risky ones — is now the default permission posture across the install base. Admins relying on the consent dialog as a visible posture check have lost that surface. Re-audit managed settings and decide where the equivalent check now lives.

Codex CLI 0.134.0: legacy profile configs rejected with migration guidance. --profile is the canonical permission selector across CLI, TUI, and sandbox flows. Scripts using older permission flag-soup must migrate before upgrade.

OpenHands main (pre-2026-05-22 SaaS deployments): MCP server and acp_env cross-org credential leak. Before PR #14528, MCP server configurations added by an org member were broadcast to every other member's row. The fix splits agent settings into shared and private halves and strips legacy leaked values on read. Multi-tenant SaaS operators on pre-fix deployments should rotate MCP credentials added before that date and confirm they are on a post-fix main build (no in-window tagged release yet).

Hermes Agent v0.14.0: PyPI distribution, lazy adapter install, and the proxy. Installation moves to pip install hermes-agent; the [all] extras are removed in favor of lazy install of heavy adapters on first use. Cold-start drops ~19s. A native Windows beta ships. The hermes proxy command exposes a local OpenAI-compatible endpoint backed by whichever OAuth provider the operator is signed into. The PR body does not specify the proxy's bind address or auth model; default-loopback-only is the safe assumption to verify, not assume.

Autonomy Stops Asking

Three providers shipped default-on autonomy in the same fortnight, and the framing is consistent enough to deserve its own paragraph.

Claude Code's Auto mode was the explicit feature. Until 2.1.152 it required consent — operators clicked through a dialog to enable it. Now it is the default. Auto mode selectively runs safe actions without prompting and blocks risky ones via a classifier; the classification is runtime-defined, not enumerated in docs. The same release adds disallowed-tools in skill and slash-command frontmatter (a skill can subtract from the agent's tool surface) and a MessageDisplay hook event that can transform or hide assistant message text on the output path. Skill authors get a way to scope down; hook authors get a new vector to filter what operators see.

Codex's goal mode is the long-horizon variant. The 26.519 product launch graduates it out of experimental across the app, IDE extension, and CLI; CLI 0.133.0 turns goals on by default with dedicated storage and progress tracking across active turns. Operators can point Codex at an objective spanning "hours or even days." Same launch ships remote computer use after Mac lock with documented safeguards: short-lived authorization, covered displays, automatic relock on local input, manual unlock fallback. The locked-host computer-use surface is gated, but the gates are policy choices, not absent capability.

Gemini CLI's Auto modes merged. The prior fan of Auto variants collapses to one. The release frames this as UX simplification; in practice it collapses whatever differentiation the variants carried. v0.44.0 stable adds shell-redirect auto-approval in AUTO_EDIT — described as quality-of-life and also an attack-surface expansion if the agent is steered toward sensitive write paths.

Operators who never enabled Auto mode now get its productivity benefit without ceremony. Operators who used the consent dialog as a manual sanity check before risky actions must build that check elsewhere — managed settings, hook policy, or out-of-band review. The accessibility win and the authority-visibility cost arrive together; the RESEARCH_CONTRACT calls this the cross-axis tension, and it is the shape of every default-on change this fortnight.

Policy Moves Into Versioned Files

The other half of the move is structural. If autonomy is the baseline and the operator decision is constraint, then per-session flags are the wrong surface. Three providers shipped, in the same fortnight, the same answer: policy lives in versioned, org-managed files consulted by the runtime.

Codex CLI 0.133.0 added permission profile inheritance — a profile can derive from another, layering changes on top of a base instead of redeclaring every grant. Managed requirements.toml integration is the org-level enforcement surface; the release describes it as enforcement, not advice. Runtime refresh lets profiles update without restart. CLI 0.134.0 then made --profile the canonical selector across the CLI, TUI permission flows, and sandbox flows, rejecting legacy configs with migration guidance.

Gemini CLI v0.44.0 integrated PolicyEngine into ACP (Agent Communication Protocol) sessions (PR #27252) — framed as a deadlock fix, but the effect is policy enforcement at the protocol-session layer, not just at the shell-tool layer. The "deadlock fix" framing understates the structural shift: enforcement now reaches into the ACP layer the docs name explicitly as the delegation primitive.

OpenHands added organization-level LLM profile storage in SaaS mode (PR #14406). Migration 116 adds an encrypted llm_profiles JSON column on the org table; six CRUD endpoints sit under /api/organizations/{org_id}/profiles. Permissions are two-tier: VIEW_ORG_SETTINGS for read; EDIT_ORG_SETTINGS for create / update / delete / rename / activate. Activate is the bigger surface; the same transaction updates the org's profiles.active and the acting member's agent_settings_diff, with SELECT ... FOR UPDATE serializing concurrent writes.

For enterprise operators, the practical implication is the same across all three: stop maintaining flat policy in per-session flags. Build a base policy (Codex profile, Gemini policy file, OpenHands org LLM profile) and derive per-team variations. The runtime now treats the file as the source of truth.

The distribution and signing model for these files is not yet fully documented in any of the three. That is the next thing to watch.

Authority Over Inputs, Three Surfaces

The third theme is quieter but the strongest single thread of the fortnight. Three providers shipped, through three very different surfaces, the same primitive: structural authority over what the agent or its inputs can do.

OpenClaw (v2026.5.26) hardened the inbound-sender layer. ClickClack allowFrom sender allowlists run before agent dispatch, not as post-dispatch blocking. Browser snapshot reads honor SSRF policy before reading tab URLs. Queued system-event text is sanitized so untrusted plugin or channel labels cannot spoof nested prompt markers. Memory store gets a separate prompt-like-text reject filter. Tool-call serializations are scrubbed from replies. The pattern: deny unauthorized senders the chance to influence agent behavior at all, rather than blocking specific actions after the agent has been biased.

Agent Zero (v1.17) hardened the host-runtime layer. The new computer_use_remote tool controls the operator's actual desktop — outside the Docker/Xpra container — with platform-specific structural targeting (macOS Accessibility / Windows UIA / Linux AT-SPI). Every state-changing action is treated as unverified until a fresh screenshot visibly confirms the outcome. Agents must stop when no screenshot is available. macOS approval denials route to a re-arm-required stop flow rather than silent retry. v1.16 made screenshot capture ephemeral and context-scoped by default — captures route through in-process image refs rather than disk, so the agent no longer leaves screenshot trails by default.

OpenHands (PR #14528) hardened the org-member layer. Before the fix, MCP server and acp_env configurations added by one org member were broadcast to every other member's row. The fix splits agent settings into a shared half and a private half; private keys go only to the acting member's row. The fix also strips legacy leaked values on read so pre-fix data stops contaminating after upgrade.

Three providers, three surfaces, one primitive: authority over inputs applied at the layer the input enters. The shapes are different — allowlist, vision verification, per-member private settings — but the principle is the same. Inputs cross trust boundaries with explicit structural gates, not by prompt discipline.

Provider Notes

Codex (26.519, CLI 0.131--0.134) shipped goal-mode graduation, remote computer use after Mac lock, Appshots, plugin marketplace sharing, profile inheritance, managed requirements.toml, codex doctor diagnostics, Python SDK first-class authentication, codex exec resume --output-schema, conversation history search, and read-only MCP concurrency via readOnlyHint. The product launch and the CLI minor releases are tightly coordinated; goal-mode graduation and CLI default-on landed the same day.

Gemini CLI (v0.44.0) shipped stable LocalSessionInvocation / RemoteSessionInvocation protocols (closing the "tests but no observed remote target" gap on the prior AgentProtocol), first-wins prioritize-project agent registration, OAuth refresh preservation during rotation, keychain auth for --list-sessions and non-interactive mode, and MCP OAuth token refresh on re-authentication. Two weeks of What's-New digests (Weeks 21--22) are not yet published; the changelog and release notes are the trailing surface.

OpenHands (main branch, no tagged release in window) shipped the ACP agent settings UI, organization-level LLM profiles, scoped MCP/ACP env to acting org members, Azure DevOps via Microsoft Entra ID OAuth/OIDC, Bitbucket DC and Jira DC integrations with KOTS-managed service accounts, and a batched CVE remediation cluster (9+ deps). The shape is consolidation as the enterprise-self-hosted shell around third-party agents and Data Center source control.

Agent Zero (v1.15--v1.18) shipped host-machine desktop control with vision verification, ephemeral context-scoped capture by default, speech as independent built-in plugins (breaking removal of legacy APIs), document_artifact → office_artifact rename, dedicated Markdown editor plugin, file-browser routing formalization, configurable max_active_skills, MCP multimodal content handling fix, and skill visibility controls (operators can hide skills from the model-facing catalog).

OpenClaw (v2026.5.18--v2026.5.26) shipped the content-boundary hardening suite, transcripts promoted to a core source-provider path with Meeting Notes plugin, reaction-based approvals across Signal / iMessage / WhatsApp, named model login profiles with credential migrations for Hermes / OpenCode / Codex, realtime Talk inspectable / steerable / cancellable across Web UI and Discord voice, on-by-default gateway auth rate-limiter for unset gateway.auth.rateLimit, and release verification stanzas with full CI run URLs and evidence manifests.

Hermes Agent (v0.14.0) is the Foundation Release: PyPI distribution, lazy adapter install with supply-chain advisory checker, native Windows beta, Zed ACP Registry listing, the OpenAI-compatible local hermes proxy, Honcho identity-mapping with peer-id in cache signatures, isolated credential pool on provider fallback, and a sustained fix(kanban) corruption-hardening wave post-release.

Paperclip (v2026.513, v2026.517, v2026.525) shipped scoped agent permissions and protected assignments via a real authorization service, routine env secrets with agent < project < routine precedence, board-managed document locks, Modal as a first-party sandbox plugin, and an ACPX-Claude adapter that resolves bare Claude model IDs, surfaces real diagnostic detail, and respects user ~/.claude/settings.json permissions.

Pi coding agent (v0.74.1--v0.76.0) shipped supply-chain hardening (npm shrinkwrap, lifecycle-script controls, isolated install smoke tests), --session-id explicit session naming and excludeFromContext flag for the bash RPC, plus provider retry and timeout bounds. Supply-chain posture lands the same fortnight as Hermes's lazy-install advisory work — two different providers converging on the same hygiene.

Flue (Tier 2; v0.6.0--v0.8.0) shipped the agents-vs-workflows category split (persistent agents/ via createAgent vs finite workflows/ via run), local() sandbox factory with env allowlist, Cloudflare Shell sandbox replacing the previously misleading R2 model, run observability with bare runId routes, an OpenAPI sub-app, and a read-only admin sub-app. The runs-as-workflow- only choice is the cleanest "what is the receipt?" answer this cycle.

What To Try

Codex operators: point goal mode at an objective spanning hours or days on 26.519 + CLI 0.133.0; observe the dedicated storage and progress-tracking surface. If you have multiple teams, draft a base permission profile and derive per-team variations using the new inheritance.
Claude Code operators: audit managed settings before upgrade to 2.1.152 if you relied on the Auto mode consent dialog as a manual posture check. Skill authors should evaluate disallowed-tools.
OpenHands evaluators: enable ENABLE_ACP and point it at Claude Code, Codex, or Gemini CLI as the back-end. Observe how LLM/Condenser/MCP settings grey out — authority shifts to the back-end agent and the UI reflects the transfer.
Agent Zero operators (host adopters): enable computer_use_remote on a non-critical host. Test the vision-verification stop flow: trigger a state change, withhold a screenshot, observe whether the agent halts as the release notes describe.
Hermes adopters: try pip install hermes-agent and route Codex CLI, Aider, Cline, or Continue through hermes proxy against a single OAuth subscription. Confirm the proxy's bind address before exposing it.
OpenClaw operators: verify your gateway.auth.rateLimit setting; the unset case is now ratelimited by default. Test the pre-dispatch allowFrom allowlist with a sender outside your trust set.

What Remains Uncertain

Codex managed requirements.toml distribution and signing: the release notes describe org-level enforcement but not how the file reaches the runtime, whether it is signed, or whether tampering is detectable. Enterprise adopters cannot rely on enforcement without this answer.
Gemini PolicyEngine-in-ACP default posture: per-session enforcement by default, or only when an operator has configured a policy? Release notes frame it as a deadlock fix. The structural shift implied by the change is larger than that framing suggests.
Agent Zero ephemeral-capture audit evidence: where does host-action evidence land for audit when screenshots are ephemeral? Operators cannot browse on-disk caches to confirm what the agent saw.
Hermes hermes proxy bind and auth model: PR body does not detail loopback-only binding or shared-token requirement. Default-loopback is the safe assumption to verify, not assume.
Gemini remote session invocation target: the protocol is stable but where remote invocations actually run (Google-hosted, operator-hosted, both) is undocumented.
OpenHands no-tagged-release operators: the strategic positioning, the org-LLM-profile feature, and the cross-org credential leak fix are all main-branch-only. Operators tracking the 1.x release channel see none of this until the next release consolidates.
The composition pattern: OpenHands ACP UI fronting Claude Code, Codex, or Gemini CLI is a multi-product composition claim that does not fit the current finding schema's single-subject assumption. Paperclip's ACPX-Claude adapter respecting ~/.claude/settings.json is the same shape. This is a schema doctrine question recorded in the audit note for this digest.
Two weeks of Claude Code What's-New digests not yet published (Weeks 21--22). The official_digest priority-1 surface in sources/claude-code.yml is missing this fortnight. Harvesters running this window must fall through to the changelog only.

Governance Becomes Enforcement

Tue, 12 May 2026 00:00:00 GMT

Governance Becomes Enforcement

Five days, nine providers. The change that cuts across all of them is deceptively simple: governance is moving from convention to enforcement.

The older model was: the agent could do X, and operators relied on prompting, documentation, and trust to prevent the wrong X. The new model -- visible in at least four independent places this week -- is: the wrong X is structurally blocked, logged, or defaults to off.

Hermes made secret redaction the default, not an opt-in. Paperclip blocked agents from self-transitioning to review without a real review path. OpenHands defaulted sub-agent delegation to off and surfaced evaluation scores in the UI. Agent Zero defaulted document output to open formats and told agents to prefer named actions over coordinate clicks. These are different tools, different teams, and different architectures. The pattern is the same: risky behavior requires explicit enablement; safe behavior is what happens by default.

The other half of the week was about durability: agents that can stay on task across turns, sessions, crashes, and context compression. Claude Code shipped a full claude agents supervisor surface and a /goal command. Hermes shipped the same /goal primitive and backed it with a Kanban board that enforces completion evidence before marking work done. Gemini made sessions portable across machines. Agent Zero made desktop sessions persistent across navigation.

These two themes -- governance as enforcement, long-horizon durability -- are not coincidental. You need both. Durability without governance means persistent agents doing the wrong thing persistently. Governance without durability means agents that are safe but cannot hold a goal long enough to finish anything.

Breaking Changes: Check These Before Upgrading

Hermes v0.13.0: secret redaction is now ON by default. If you have Hermes log pipelines that read raw agent output, they will receive sanitized logs after upgrade. This is the right call as a default; it is a breaking change for tooling that depends on unredacted output.

Paperclip v2026.512.0: SSH host environment was leaking. Before PR #5142, SSH remote execution forwarded the Paperclip host's environment variables -- including API keys and tokens -- to remote execution targets. Operators running SSH-managed agents should treat this as a security advisory and upgrade.

Hermes v0.13.0: Discord role allowlists are now guild-scoped. The prior behavior allowed a role match from any guild to authorize a cross-guild DM -- a CVSS 8.1 bypass. Discord operators using role-based access control should reverify their configuration.

Claude Code v2.1.x: worktree.baseRef now defaults to "fresh". New worktrees now branch from origin/<default> rather than the local HEAD. Operators who depended on new worktrees carrying unpushed local commits should set worktree.baseRef: "head" explicitly.

Pi v0.74.0: package scope migration underway. The npm package is moving from @mariozechner/pi-coding-agent to @earendil-works/pi-coding-agent. Global installs: run pi update --self once the new package publishes. CI, Dockerfiles, and package.json pins: update the reference manually.

Evidence Before Completion

Two providers shipped independent enforcement of the same principle this week: agents cannot self-attest that work is complete.

Hermes's Kanban board now requires workers to have valid card references before a task card moves to done. The hallucination gate verifies that cards a worker claims to have created actually exist and belong to that worker -- blocking phantom references and cross-worker card claims. Workers that exit without completing are auto-blocked. Heartbeats detect stale workers; zombie processes are detected on both platforms. Per-task retry budgets prevent silent cascades.

Paperclip's control-plane fix (PR #5292) blocks agents from self-transitioning an issue to in_review state. The in_review transition now requires a real review precondition, not just a model deciding it is ready for review.

These are different mechanisms -- Hermes's is multi-agent coordination, Paperclip's is a state-machine gate -- but the observation is the same: agent claims about their own completion are not sufficient evidence of completion. The system needs to verify independently.

For operators building multi-agent workflows: the completion contract is now part of the orchestration contract, not just the prompt.

Long-Horizon Durability

The week's most operator-visible features are all about agents staying on task.

Claude Code's claude agents supervisor view shows every session by state -- working, waiting on you, done, failed -- with background sessions running under a persistent supervisor process that survives terminal close. Sessions isolate to separate git worktrees automatically. You can dispatch from the prompt, background an active session with one keystroke, and reply to blocked sessions from a peek panel without attaching. Alongside it, the /goal command sets a completion condition that Claude tracks across turns until met.

Hermes's /goal Ralph loop does the same thing at the session level, backed by the Kanban reliability primitives above. Lock the agent onto a target and it persists across context compression, turn budgets, and branching. The Kanban layer handles the multi-agent case: workers pick up tasks, execute them, and cannot mark themselves done without evidence.

Gemini CLI's session export/import makes sessions portable: export a session, move it to another machine, import and continue. State crosses as a serializable object rather than ambient context.

Agent Zero's persistent desktop lifecycle (v1.13) changes the semantics of the desktop environment: a single Xpra XFCE session stays alive across canvas navigation, modal switches, and keepalive hosts. Explicit shutdown is distinguished from crashes; unsafe affordances are hidden. The desktop is now a persistent surface, not one that resets on navigation.

Authority Made Visible

Three separate tools shipped changes this week that make permission state observable at a glance.

Codex's TUI now shows permissions and approval-mode as separately configurable status-line items. The most common operator surprise before this -- forgetting which permission posture is active before an irreversible command -- is now a visual check.

Claude Code's claude agents supervisor makes session state visible: working, waiting, done, or failed. A single panel replaces five terminal windows. The live overlay on /goal tracks elapsed time, turns, and tokens consumed.

OpenHands's new critic evaluation display shows a score (0--1), star rating (0--5), and color-coded bands in the GUI for every completed session: agent_behavioral_issues, user_followup_patterns, and infrastructure_issues. The display is deployment-controlled via OH_ENABLE_CRITIC_BY_DEFAULT (disabled by default). When enabled, it creates a feedback loop that doesn't exist when evaluation lives in logs: users see when sessions are degrading in real time.

Agent Zero's Linux Desktop skill takes this in a different direction: it tells the agent to use named structured actions (cell_edit, app_launch, form_submit) and treat coordinate clicks (click(x=423, y=187)) as a last resort. The principle is audit clarity. cell_edit(B3, 42) is meaningful; a coordinate click is not. An action that can be named and described is easier to verify, replay, and record than one that can only be described by its position.

Default-Closed Governance

The week also continued a trend across the watchlist: sensitive capabilities default to closed, and operators must explicitly enable them.

OpenHands's sub-agent delegation (enable_sub_agents) defaults to off. Behind the gate, the orchestrator routes tasks to specialized sub-agents -- a bash runner, a code explorer, a web researcher -- each with tool surfaces defined by TaskToolSet rather than full access. The default-off choice is right: routing work to specialized agents changes session scope, cost, and authority in ways that require deliberate operator decision.

OpenClaw's skill archive upload gate (skills.install.allowUploadedArchives) defaults to closed. Trusted Gateway clients can stage and install zip-backed skills only when the operator explicitly enables the flag. OpenClaw keeps repeating this pattern: code-execution surfaces are opt-in, explicit, and documented as requiring trust.

Agent Zero's ODF-first document default (v1.13) inverts the prior assumption: document artifacts now default to ODT/ODS/ODP (open formats) rather than DOCX/XLSX/PPTX. OOXML compatibility requires explicit opt-in. For operators with downstream workflows expecting Office XML output, this is a change to verify before upgrading.

Provider Notes

Claude Code (v2.1.139) adds settings.autoMode.hard_deny: hard blocks that no allow rule can override. The continueOnBlock option for PostToolUse hooks feeds the rejection reason back so Claude can adapt rather than just stop. API key auth now disables Remote Control, /schedule, and claude.ai MCP connectors -- operators using API key auth should audit reliance on those surfaces.

OpenClaw (v2026.5.10 beta) adds per-agent message send restrictions (tools.message.crossContext, tools.message.actions.allow) that let you deploy a sandboxed agent that can only reply in the thread it was addressed in. Memory auto-promotion is now bounded: the dreaming process compacts the oldest sections when the budget is reached, while preserving user-authored notes. Transcript reads are now streaming; peak memory for a long session dropped roughly 90%.

Paperclip (v2026.512.0) adds secrets provider vault configuration with AWS Secrets Manager as the first remote-import backend. The database gains secret_access_events and company_secret_provider_configs tables. The new cursor_cloud adapter routes work to Cursor's hosted-agent platform.

Agent Zero (v1.11--v1.13) completes what it calls the "visible computer": browser with multi-tab parallel fanout, LibreOffice desktop via Xpra/XFCE, and a persistent desktop session. The multi browser action fans out reads or mutations across tabs in a single tool call with parallel execution.

Gemini CLI (v0.41.0) adds a pluggable AgentProtocol with local and remote backends, forcing the "where does delegated work actually run" question into a surface that can be inspected and configured. Workspace trust now enforces in headless mode; shell command validation gains a core-tools allowlist.

Pi coding agent (v0.74.0) migrates from badlogic/pi-mono to the Earendil Works organization. JSONC parsing for models.json is new (comments and trailing commas now valid).

What To Try

Hermes operators: verify your log pipeline handles sanitized output before upgrading to v0.13.0. Redaction is now on.
Paperclip operators running SSH: upgrade before deploying new remote agents. The host env isolation fix is silent in prior versions.
Claude Code: dispatch a background session with claude --bg "<prompt>", use claude agents to monitor, and test peek/reply from the list. Set a /goal on a multi-step task and inspect the turn/token overlay.
OpenHands: enable enable_sub_agents in a multi-task session. Observe whether sub-agent scoping reduces total session cost or context accumulation.
Agent Zero: create a Writer document and confirm the output is ODT (not DOCX) in v1.13+. Verify your downstream tooling handles ODT, or explicitly configure OOXML output.
Codex: add both permissions and approval-mode to your status line if you run multiple permission profiles.

What Remains Uncertain

Hermes Kanban hallucination gate: what does verification involve? Is it model-based, schema-based, or rule-based? The gate's false-positive rate under real multi-agent workloads is not yet documented.
Paperclip in_review gate: what constitutes a "real review path"? The PR notes do not define whether a human reviewer, an automated review step, or a configured participant list is required.
OpenHands critic calibration: what does a score of 0.4 mean operationally? When does agent_behavioral_issues fire versus user_followup_patterns? The calibration methodology is not yet documented.
Gemini RemoteSubagentProtocol: ships with tests but no observed remote target. Whether the remote execution surface runs on a Google-hosted infrastructure or a user-controlled one is not yet established.
Claude Code /ultrareview: the research preview returns verdicts to CLI/Desktop but the output schema is not documented. How should a CI pipeline ingest or route the findings?
Agent Zero desktop state: is there a session timeout, an idle cleanup, or a storage limit for persistent Xpra sessions? Or does the operator manage cleanup entirely manually?
OpenClaw skill archive trust model: skills.install.allowUploadedArchives is opt-in, but signature checking and sandbox isolation for uploaded archives are not yet documented.

The Harness Leaves The Chat Box

Thu, 07 May 2026 00:00:00 GMT

The Harness Leaves The Chat Box

The last two weeks of commits make one thing clear: the interesting action in coding agents is no longer confined to the model or the chat transcript.

Agent harnesses are becoming operating surfaces.

Codex is adding persistent goals, session metadata, memory plumbing, plugin controls, sandbox work, and cloud executor paths. Gemini CLI is treating memory as a reviewable patch, with workspace trust, approval modes, shell safety, and structured non-interactive output close behind. Hermes is sanding down the rough edges of persistent personal agents: gateways, systemd, voice, themes, model providers, skills, search, kanban, and memory scoping. Pi keeps proving the opposite design lesson: a thin harness can move quickly because integrations can be added, removed, or rewritten without becoming the whole product.

The expanded watchlist changes the story. OpenClaw shows that accessibility is not a side quest; ordinary surfaces like Discord, Telegram, WhatsApp, OAuth, voice, onboarding, and visible progress are where agents become usable. Agent Zero shows the workcell becoming literal: browser, desktop, documents, file browser, screenshots, OAuth, and time-travel state. Paperclip shows the company/control-plane version of the problem: remote provisioning, sandbox providers, cost summaries, roles, liveness, pause/resume, and stale session recovery. OpenHands shows what happens when a harness becomes a platform: app server, model profiles, MCP proxying, secrets, security redaction, self-hosted integrations, sandbox grouping, and old runtime cleanup.

The frontier is not one winning agent. The frontier is the environment around agents getting thicker.

The Week In One Sentence

Coding agents are gaining goals, memory, computers, permissions, gateways, integrations, and supervision layers; the durable question is who owns the loop around all of that.

Main Signals

1. Persistent Agent State Is Becoming A Product Surface

The strongest single signal is still Codex /goal. It is not just a UX affordance. The goal validation work shows that persistent objectives now deserve first-class validation, paste handling, queued-command behavior, and user guidance.

Gemini's Auto Memory inbox points in the same direction from another angle: memory should be proposed, reviewed, and accepted, not silently smeared into hidden context. Hermes adds memory scoping and Curator commands. OpenClaw is making agent progress visible in chat with timeline spans.

This is a real shift. Agent-side state is becoming more durable, more visible, and more operational.

Builder question:

What goal, memory, session, recap, skill report, or thread state shaped this run?

2. The Agent Interface Is Becoming A Visible Computer

Agent Zero is the clearest evidence. It replaced a browser-use agent with a native browser, then added a Chromium runtime, browser tabs, screenshot previews, annotation, file browser search, ZIP downloads, Linux desktop controls, document canvas, LibreOffice runtime, and OAuth/quota visibility.

OpenHands is moving in the same broad direction from the platform side with sandbox grouping, app-server routing, ACP/MCP surfaces, user secrets, model profiles, and enterprise integrations. Paperclip adds remote provisioning and sandbox provider work. Codex is adding cloud executor paths and sandbox hardening.

The chat box is not enough. Serious agent work wants a visible machine.

Builder question:

Can I see the browser, files, runtime, screenshots, credentials, and artifacts that shaped this work?

3. Permissions, Secrets, And Sandboxes Are Moving Into The Foreground

This window is full of authority work. Codex has permission profiles, sandbox profiles, plugin sharing controls, MCP metadata, and Linux sandbox hardening. Gemini has workspace trust, private memory patch allowlists, shell safety evals, approval-mode-aware subagents, and policy-engine work. OpenHands tightened redaction and removed a secret log. OpenClaw is fixing allowlists, subagent security docs, OAuth labels, and live exec output limits. Paperclip is adding security roles and sandbox provider contracts. Agent Zero keeps browser and office surfaces opt-in and exposes OAuth disconnect and quota visibility.

This is the right direction. The harness is starting to show its authority model.

Builder question:

What could this agent read, change, execute, install, send, or leak?

4. Accessibility Is A Frontier Capability

OpenClaw is the necessary corrective to an overly technical reading of the market. Its commits are full of work that makes agents usable by normal people: setup recovery, stale plugin repair, Discord voice behavior, Telegram reactions, WhatsApp identity mapping, OAuth labels, progress previews, chat drafts, typography cleanup, install recovery, and group allowlists.

Hermes is doing adjacent work through setup fixes, voice push-to-talk parity, dashboard themes, gateway restart readiness, provider pickers, and messaging surfaces. Agent Zero is making the computer visible with screenshot previews. Pi is improving login, terminal rendering, compact resource reads, clipboard behavior, and quickstart docs. Gemini is making memory reviewable and headless auth more reliable. OpenHands is exposing model names and model switching in the UI.

That matters. Accessibility is not softness. It is distribution, trust, and operator leverage.

Builder question:

Can a real person start, understand, recover, and control this thing without learning the project owner's private ontology?

5. Agent Systems Are Growing Control Planes

Paperclip makes the control-plane problem explicit. It is working on runtime specs, sandbox providers, cost summaries, roles, liveness, stale sessions, issue workflows, ordered sub-issues, pause/resume controls, and remote workspace shaping.

OpenHands is consolidating around the app server. Hermes has kanban task runners, gateway lifecycle, Curator, providers, and dashboard state. Codex is moving skills, goals, sessions, plugins, and executors into app-server-shaped surfaces. OpenClaw is handling gateway sessions, subagents, plugin metadata, and live execution timelines.

This is the factory problem in miniature.

Builder question:

When agents coordinate across tasks and machines, what keeps the system legible?

6. Integrations Are Volatile; The Operating Loop Has To Be Durable

Pi added providers, removed providers, changed Codex transport, added auth flows, improved session behavior, and kept terminal output evolving. Hermes is moving model providers into plugins. OpenClaw is externalizing channel plugins. OpenHands is replacing config surfaces and moving toward app-server services. Codex and Gemini are evolving plugin, MCP, memory, and approval surfaces quickly.

This is not a warning against using frontier tools. It is the reason to use them through a durable loop.

Builder question:

What should remain stable while the best agent, provider, runtime, protocol, or plugin changes every week?

What Serious Builders Should Try

Test persistent goals, but write down what owns the project-level objective before you trust the agent's local goal.
Prefer memory systems that show proposed changes before accepting them.
Try at least one visible-computer harness. The browser, file system, screenshots, and desktop surface reveal different failure modes than terminal chat.
Inspect the permissions and sandbox story before giving an agent real credentials.
Treat messaging and voice surfaces as product lessons, not consumer fluff.
Track exact harness version, provider, transport, plugin set, sandbox, and credential path for serious runs.

What Remains Uncertain

OpenClaw's high commit volume makes it hard to separate durable product movement from rapid stabilization without deeper release-note and diff review.
This run is commit-harvest focused. Claude Code was excluded because the v0 source contract does not define a public commit stream.
Commit metadata was broad-sampled across all projects, but only selected high-signal commits received diff-level review.
The frontier may be converging on visible computers, but the winning shape is still open: local desktop, browser sandbox, remote workcell, hosted app server, messaging agent, or some combination.
It is unclear which agent-side memories and goals will remain stable enough to integrate deeply versus merely record as tool-local state.

Coding Agents Are Becoming Working Environments

Wed, 06 May 2026 00:00:00 GMT

Coding Agents Are Becoming Working Environments

The last two weeks were not about one coding agent pulling ahead. They were about the layer around coding agents getting more serious.

By harness, I mean the practical wrapper around the model: the CLI, permission system, memory surface, sandbox, plugin layer, review flow, and runtime assumptions that determine what the agent can actually do.

Codex added persistent goals. Claude Code pushed deeper into cloud review, session recaps, plugins, hooks, MCP, and telemetry. Gemini CLI tightened workspace trust and environment loading while experimenting with reviewable memory patches. Hermes added a background Curator for skill maintenance. Pi kept proving the other side of the market: a small terminal harness can move quickly by keeping the core thin.

The through line is simple: coding agents are becoming less like chat boxes and more like working environments.

That is useful, but it also raises the stakes. If the agent can remember, review, load plugins, carry goals, and run under different permission modes, then serious developers need to know which of those surfaces shaped the work.

The Signals

Persistent goals move coding agents beyond single sessions

Codex /goal is the strongest signal in this window. It gives the agent something more durable than a prompt: a persistent objective it can carry across a longer arc of work.

That matters because long-horizon development is not just a code-generation problem. It is an orientation problem. The agent has to stay pointed in the right direction across sessions, reviews, interruptions, and course corrections.

The new question is not "can the agent remember?" It is "what goal is it pursuing, and who decided that goal is still the right one?"

For Bitter, the answer should be conservative: use agent goals, but record which goal was active and reconcile it against the project charter and current task before treating it as durable project memory.

Supported by:

Codex

Agent memory is becoming a product surface

Claude session recaps, Gemini Auto Memory, and Hermes Curator all point in the same direction: agent tools are learning how to carry context forward.

That is good. It also means memory is no longer one thing. A serious run may now be shaped by chat history, session recaps, generated memory patches, skill reports, resume state, and local project notes.

Bitter should not fight those surfaces. It should record which agent-side memory affected a run, then decide what deserves to become part of the project's durable record.

Supported by:

Permissions are getting clearer, but every agent does them differently

Codex expanded permission profiles and sandbox metadata. Gemini added secure .env loading, workspace trust, and shell allowlists. Claude Code kept moving around plugins, hooks, MCP, telemetry, and permission prompts. Pi's provider and extension layers changed quickly.

The direction is good: agents are exposing more of the authority they run with. The problem is fragmentation. Every tool names and scopes that authority in its own way.

For serious work, "the agent had access" is not enough. The useful question is more specific: what could it read, what could it change, which plugins were enabled, which credentials were exposed, which sandbox was active, and which release channel was running?

Supported by:

Review is moving inside the agent tools

Claude /ultrareview is the cleanest example: provider-native cloud fleets can review branches and PRs. Codex multi-agent controls, Gemini subagent and eval work, and Hermes Curator reports rhyme with it.

This is a useful direction. Agent tools should be able to criticize their own work. But native review is still evidence, not truth.

A review surface can produce a useful claim: "this looks risky," "this path failed," "this patch needs another pass." The project still needs an external standard for what counts as done.

Supported by:

Plugins and skills are becoming the new agent interface

Codex plugins, Claude plugins, Gemini extensions and MCP, Hermes skills, and Pi extension APIs are all part of the same shift. The practical power of an agent is moving into the things around it.

That makes the harness more useful, but also harder to reason about. If a run depends on a plugin, extension, hook, skill, or transport layer, that surface is part of the work environment and should be visible in the record.

Supported by:

Do not build your workflow around one agent's current integration list

Pi removed built-in Gemini CLI and Antigravity support while adding many new providers. Gemini's stable, preview, and nightly channels differ materially. Codex alpha and app-server surfaces move quickly.

This is normal frontier motion. The mistake is treating any current integration list as durable architecture.

The stable layer should be the project workflow around the agent: objective, permissions, execution environment, evidence, review, memory, and what the next run should know.

Supported by:

What Serious Developers Should Do

Treat persistent goals as useful, but make sure they still match the project-level direction.
Treat agent-side memory as context, not automatically as the project record.
Record which goals, recaps, memories, plugins, skills, permission modes, release channels, and transports were active during serious runs.
Prefer tools that make trust, sandboxing, plugins, sessions, and review state easy to inspect.
Treat native agent review as evidence, not final judgment.

What Bitter Is Testing

How Codex /goal changes long-running work in a real repo.
How to record agent memory, permissions, plugins, review output, and release channels without tying Bitter to one tool's vocabulary.
Whether Claude /ultrareview, Gemini memory patches, Hermes Curator, and Pi extension metadata produce evidence worth carrying forward.
Which agent harnesses expose enough state to be trustworthy over long runs.
How to keep the public research loop conservative: no signal unless it can change what someone does next.

What Remains Uncertain

Whether persistent goals become stable enough for long-horizon development or remain convenience features tied to one tool.
Whether agent memory surfaces converge, or each product keeps inventing its own private memory layer.
Whether cloud/native review produces evidence that is inspectable enough for serious work.
Whether plugin and skill ecosystems converge around useful metadata.
Which agent tools expose enough permission, session, plugin, transport, and release-channel state to support trustworthy wrapping.