Research Version

The Harness Leaves The Chat Box

2026-05-07-commit-harvest-2026-04-23_2026-05-07-frontier-v1

Status: published
Window: 2026-04-23 to 2026-05-07
Signals: 6

Mode: manual_commit_harvest

Source contracts

Accepted signals from this run

Artifact contents

Every file the loop produced for this run, anchored in the repo. Internal links go to the rendered page; the repo path opens the raw artifact on GitHub.

Run digest

The last two weeks of commits make one thing clear: the interesting action in coding agents is no longer confined to the model or the chat transcript.

Agent harnesses are becoming operating surfaces.

Codex is adding persistent goals, session metadata, memory plumbing, plugin controls, sandbox work, and cloud executor paths. Gemini CLI is treating memory as a reviewable patch, with workspace trust, approval modes, shell safety, and structured non-interactive output close behind. Hermes is sanding down the rough edges of persistent personal agents: gateways, systemd, voice, themes, model providers, skills, search, kanban, and memory scoping. Pi keeps proving the opposite design lesson: a thin harness can move quickly because integrations can be added, removed, or rewritten without becoming the whole product.

The expanded watchlist changes the story. OpenClaw shows that accessibility is not a side quest; ordinary surfaces like Discord, Telegram, WhatsApp, OAuth, voice, onboarding, and visible progress are where agents become usable. Agent Zero shows the workcell becoming literal: browser, desktop, documents, file browser, screenshots, OAuth, and time-travel state. Paperclip shows the company/control-plane version of the problem: remote provisioning, sandbox providers, cost summaries, roles, liveness, pause/resume, and stale session recovery. OpenHands shows what happens when a harness becomes a platform: app server, model profiles, MCP proxying, secrets, security redaction, self-hosted integrations, sandbox grouping, and old runtime cleanup.

The frontier is not one winning agent. The frontier is the environment around agents getting thicker.

That is useful pressure for operators. The stronger the agent tools get, the more valuable the developer's own loop becomes.

The Week In One Sentence

Coding agents are gaining goals, memory, computers, permissions, gateways, integrations, and supervision layers; the durable question is who owns the loop around all of that.

Main Signals

1. Persistent Agent State Is Becoming A Product Surface

The strongest single signal is still Codex /goal. It is not just a UX affordance. The diff-reviewed commit around goal validation shows that persistent objectives now deserve first-class validation, paste handling, queued-command behavior, and user guidance.

Gemini's Auto Memory inbox points in the same direction from another angle: memory should be proposed, reviewed, and accepted, not silently smeared into hidden context. Hermes adds memory scoping and Curator commands. OpenClaw is dealing with memory wiki details, task reload blockers, gateway session files, and visible tool progress in chat channels.

This is a real shift. Agent-side state is becoming more durable, more visible, and more operational.

The operator question:

What goal, memory, session, recap, skill report, or thread state shaped this run?

The durable answer:

Use agent-side state, but record it as agent-side state. The agent may carry a goal. The developer owns the charter.

Supported by Codex, Gemini CLI, Hermes, and OpenClaw.

2. The Agent Interface Is Becoming A Visible Computer

Agent Zero is the clearest evidence. It replaced a browser-use agent with a native Playwright-powered browser, added persistent Chromium runtime work, browser tabs, screenshot previews, annotation, file browser search, ZIP downloads, Linux desktop controls, document canvas, LibreOffice runtime, and OAuth/quota visibility.

OpenHands is moving in the same broad direction from the platform side: sandbox grouping UI, app-server routing, ACP/MCP surfaces, user secrets, model profiles, and enterprise integrations. Paperclip adds remote runtime provisioning and sandbox provider work. Codex is adding cloud executor paths and sandbox hardening.

The chat box is not enough. Serious agent work wants a visible machine.

The operator question:

Can I see the browser, files, runtime, screenshots, credentials, and artifacts that shaped this work?

The durable answer:

Workcells should be leased, bounded, visible, resumable, and evidence-bearing. A workcell is not just a place to run commands. It is where agent labor becomes inspectable software work.

Supported by Agent Zero browser, Chromium runtime, OpenHands, and Paperclip.

3. Permissions, Secrets, And Sandboxes Are Moving Into The Foreground

This window is full of authority work. Codex has permission profiles, sandbox profiles, plugin sharing controls, MCP metadata, and Linux sandbox hardening. Gemini has workspace trust, private memory patch allowlists, shell safety evals, approval-mode-aware subagents, and policy-engine work. OpenHands tightened log redaction and removed a debug log exposing hook config secrets. OpenClaw is fixing access group allowlists, subagent security docs, OAuth labels, and live exec output limits. Paperclip is adding security roles and sandbox provider contracts. Agent Zero keeps browser and office surfaces opt-in and exposes OAuth disconnect and quota visibility.

This is the right direction. The harness is starting to show its authority model.

The operator question:

What could this agent read, change, execute, install, send, or leak?

The durable answer:

Credential handling and permission profiles should model the real authority surface of each harness, not an idealized abstraction. The run record should include credentials, plugins, approval mode, sandbox, network posture, OAuth state, and any known secret-handling caveats.

Supported by OpenHands redaction, hook config, Gemini CLI, Codex, and OpenClaw.

4. Accessibility Is A Frontier Capability

OpenClaw is the necessary corrective to an overly internal reading of the market. Its commits are full of work that makes agents usable by normal people: channel setup recovery, stale plugin repair, Discord voice behavior, Telegram reactions, WhatsApp identity mapping, OAuth labels, progress previews, chat drafts, typography cleanup, install recovery, and group allowlists.

Hermes is doing adjacent work through setup wizard fixes, voice push-to-talk parity, dashboard themes, gateway restart readiness, provider pickers, and messaging surfaces. Agent Zero is making the computer visible. Pi is improving login, terminal rendering, compact resource reads, clipboard behavior, and quickstart docs. Gemini is making memory reviewable and headless auth more reliable. OpenHands is exposing model names and model switching in the UI.

That matters. Accessibility is not softness. It is distribution, trust, and operator leverage.

The operator question:

Can a real person start, understand, recover, and control this thing without learning the project owner's private ontology?

The durable answer:

Internal doctrine can stay deep, but the public surface needs humane language and visible affordances. The lesson from OpenClaw is not to become casual. It is to make serious authority understandable.

Supported by OpenClaw setup, OAuth status, Hermes, Agent Zero, and Pi.

5. Agent Systems Are Growing Control Planes

Paperclip makes the control-plane problem explicit. It is working on remote provisioning, sandbox providers, cost summaries, roles, liveness, stale sessions, issue workflows, ordered sub-issues, pause/resume controls, and remote workspace shaping.

OpenHands is consolidating around app-server reality. Hermes has kanban task runners, gateway lifecycle, Curator, provider modules, and dashboard state. Codex is moving skills, goals, sessions, plugins, and executors into app-server-shaped surfaces. OpenClaw is handling gateway sessions, subagents, plugin metadata, and live execution timelines.

This is the agent-orchestration problem in miniature.

The operator question:

When agents coordinate across tasks and machines, what keeps the system legible?

The durable answer:

The orchestration layer should not become the agent. It should own the joined operating view and the run contract: charter, scope, agent, runtime, authority, cost, evidence, recovery, and next action.

Supported by Paperclip runtime specs, cost summaries, OpenHands, and Hermes.

6. Integrations Are Volatile; The Operating Loop Has To Be Durable

Pi added providers, removed providers, changed Codex transports, added auth flows, improved session behavior, and kept terminal output evolving. Hermes is moving model providers into plugins. OpenClaw is externalizing channel plugins. OpenHands is replacing config surfaces and moving toward app-server services. Codex and Gemini are evolving plugin, MCP, memory, and approval surfaces quickly.

This is not a warning against using frontier tools. It is the reason to use them through a durable loop.

The operator question:

What should remain stable while the best agent, provider, runtime, protocol, or plugin changes every week?

The durable answer:

The agent can change. The charter, authority, evidence, verification, memory, and next run should compound.

Supported by Pi removals, Codex transport, Hermes providers, and OpenClaw plugins.

What Serious Builders Should Try

Test persistent goals, but write down what owns the project-level objective before you trust the agent's local goal.
Prefer memory systems that show proposed changes before accepting them.
Try at least one visible-computer harness. The browser, file system, screenshots, and desktop surface reveal different failure modes than terminal chat.
Inspect the permissions and sandbox story before giving an agent real credentials.
Treat messaging and voice surfaces as product lessons, not consumer fluff.
Track exact harness version, provider, transport, plugin set, sandbox, and credential path for serious runs.

What This Research Should Test Next

Codex /goal as a provider-native goal under an operator charter.
Gemini Auto Memory as a reviewable memory proposal source.
Agent Zero as a visible workcell.
Paperclip's adapter runtime command spec as a model for run provisioning contracts.
OpenHands secret/log/sandbox patterns against credential and workcell boundaries.
OpenClaw setup recovery and channel progress visibility as accessibility benchmarks.
Pi as a thin, replaceable agent adapter with exact provider and transport records.

What Remains Uncertain

OpenClaw's high commit volume makes it hard to separate durable product movement from rapid stabilization without deeper release-note and diff review.
This run is commit-harvest focused. Claude Code was excluded because the v0 source contract does not define a public commit stream.
Commit metadata was broad-sampled across all projects, but only selected high-signal commits received diff-level review.
The frontier may be converging on visible computers, but the winning shape is still open: local desktop, browser sandbox, remote workcell, hosted app server, messaging agent, or some combination.
It is unclear which agent-side memories and goals will remain stable enough to integrate deeply versus merely record as tool-local state.

Receipts

Findings and signal records for this run are under:

runs/2026-05-07-commit-harvest-2026-04-23_2026-05-07-frontier-v1/

Sources

Primary links, including exact changelog lines when available.