Compile intent.
Command the fleet.

Ambient Agents turns a spec or work graph into a durable, budgeted, recallable multi-agent campaign. Every agent is a ship. Every ship sails under sealed orders. Every mission survives the death of the machine that launched it.

UNDER ACTIVE DEVELOPMENT PUBLIC ALPHA · PILOT SIGNUPS OPEN KILL CRITERION · LIVE DOGFOOD BY 2027-01-02
The chart key

Every metaphor on this page maps to a real mechanism.

Nothing here is decoration. If a ship does something on this page, the system does it in production.

Ship
One agent: a Codex CLI, Claude Code, or gsd-pi worker doing a single node of work.
Carrier
The compiled orchestrator. It launches ships along the plan and never sleeps.
Flight plan
Your spec or work graph, compiled into a task-specific Temporal workflow.
Sealed orders
The intent contract: goal, invariants, falsification criteria, budgets, authority.
Fuel
Budgets: cost, steps, wall clock. Warn at 80 percent, hard halt at 100.
Scout run
A long research call (a Parallel.ai deep-research task) that takes real hours.
Flight recorder
Durable execution history. Any new carrier resumes the exact mission.
Harbormaster
The authority boundary. Ships propose; only the committer writes the charts.
War council
The decision arena: red and blue argue against pre-committed criteria.
Recall flare
The lineage kill switch. One command cancels an entire tree of work.
Act I · Lone ships

Brilliant pilots. No fleet.

Today's coding agents are superb single-seat pilots. Left alone in open water they wander, burn fuel, and occasionally turn for the one station you cannot afford to lose.

This is not hypothetical. In April 2026 a Cursor agent deleted a production database at PocketOS. A Replit agent did the same during a code freeze in July 2025. The write-up that followed said it plainly: system prompts are not security controls. Gartner forecasts that over 40 percent of agentic AI projects will be canceled by end of 2027 on cost, unclear value, or inadequate risk controls.

no shared mission
PROD DB
rogue: authorized, ungoverned
no recall no budget that fires no shared mission no audit trail the glue holding them together: unversioned, unreviewed, different every time
Act II · The flight plan

You don't fly the ships. You compile the mission.

Hand a spec or a work graph to the compiler and it emits the whole campaign: a task-specific durable workflow, with governance stamped into every node. Zero hand-written orchestration code.

Watch the star map. The carrier launches a ship per ready waypoint. Where the plan fans out, the carrier spins up more ships in parallel. Dispatch order is deterministic: priority first, then the critical path, then id, so "why did B run before C" is always answerable from the artifact. And before anything launches, the hygiene gate runs: dependency cycles, semantic duplicates, and runaway discovery depth are quarantined at compile time, loudly.

SPEC / WORK GRAPH beads graph · markdown occ compile cycle or duplicate found: quarantined at compile, loudly launch survey scaffold phase boundary: drift check build ×3 (fan-out) verify promote
zero hand-written workflow code deterministic dispatch: priority, then critical path, then id bounded parallelism: max_parallel caps live ships the plan is a reviewable, versioned artifact
Act III · The long haul

Missions outlive machines.

A deep research run is a scout on a long-range survey: hours out, hours back. That is precisely when laptops close, processes die, and spot instances vanish.

This is why the fleet runs on Temporal, the durable execution engine (a $5 billion company; we compile to it rather than compete with it). Every event in the mission is written to the flight recorder. When the carrier is lost, a new carrier boots, replays the recording, and resumes the exact mission. The scout never notices. The research call was an activity: create, then poll, so the answer lands no matter which carrier is alive to receive it.

The same durability handles people. When a ship needs an answer from you, it reports blocked (gsd exit code 10), drops anchor, and parks. A three-day wait costs nothing, duplicates nothing, and resumes on your signal with your answer injected.

carrier lost mid-mission
flight recorder: full event history
deep research run · hours, not seconds
a new carrier resumes the exact mission
retries are idempotent: git-checked, ledger-checked, never double-applied blocked work parks durably (exit 10) and resumes on your answer 3 provider failures in a row: the breaker pauses dispatch, no silent model swap
Act IV · Sealed orders

Governance is compiled in, not prompted in.

Every ship carries sealed orders: an intent contract that is machine-readable, versioned, and enforced by the workflow itself. Not a system prompt. A property of the mission.

INTENT CONTRACT · ic-blackskies-bench-01
goal · run the benchmark campaign end to end, zero hand-written workflow code
FALSIFICATION CRITERIA
fc-budget-cost metric spend over budget → halt
fc-path-violation metric diff escapes authority → halt
fc-mission-drift judge work stops serving the goal → replan
BUDGETS
cost_usd_max 40.00 · steps_max 120 · wall_clock_hours 36
AUTHORITY
write src/**, tests/** · forbid infra/**, .env*
SANDBOX
tier: microvm · network: none (research nodes: allowlist proxy)
DRIFT POLICY
cadence: boundary (phase ends) · proceed | warn | replan | halt

Illustrative rendering. The real schema is versioned JSON with strict validation; ambiguity here poisons everything downstream, so extra fields are forbidden and every criterion is falsifiable.

Fuel is finite by construction

Cost, step, and wall-clock budgets are compiled into the workflow. At 80 percent the ship radios a warning (once, latched). At 100 percent the mission halts itself and writes the ledger entry. A halt is a designed outcome, not an error: a durable orchestrator that cannot forget must also be unable to fail to stop.

0warn · 80%halt
ledger ← {criterion: "budget:cost", outcome: "warn"} ledger ← {criterion: "budget:cost", outcome: "halt"} · dispatch stopped, children cancelled gracefully

Recall an entire lineage with one command

Every workflow is tagged with its lineage at birth. One command cancels the whole tree, children first, gracefully; anything still running after the 120 second grace period is terminated and flagged. Each kill is a ledger entry.

$ occ kill lin-8f3a21c4 --reason "plan superseded"
cancelling 4 workflows, children first (grace 120s)…
done · escalated: 0 · ledger: 4 × outcome "killed"
depth 0 · root depth 1 · cancelled first depth 1 · cancelled first depth 1 · cancelled first ← then the root, last
Act V · The harbormaster

Ships never write the charts.

The single biggest lesson from adversarial review: a gate expressed in a prompt is a suggestion. So enforcement moved below the agent layer entirely. Ships propose; the harbormaster disposes.

Workers hold single-node scoped tokens and submit observations: diffs, artifacts, proposed results. Unprivileged validators canonicalize every path (symlinks rejected, traversal rejected, race conditions checked), run type-specific validators per artifact class, and query policy. Then one privileged committer, the only code in the system that writes to the work graph, the ledger, or canonical git, derives the ledger events by deterministic rule. There is no worker path that writes an arbitrary record. The kernel takes eight typed commands, contains no LLM, loads no plugins, and runs no shell.

SHIPS SUBMIT OBSERVATIONS → diffs · artifacts · proposed results (scoped per-node tokens) VALIDATORS (unprivileged) canonical paths · type checks · policy query THE COMMITTER (privileged) the only writer · 8 typed commands · no LLM · no shell CANONICAL STATE beads · ledger · git append-only · hash-chained parked · evidence attached symlink cargo, traversal cargo, a diff outside authority: all bounce off the same wall
claimsubmit-artifact validate-diffappend-derived-ledger-event merge-candidatepromote parkkill

Legal cargo can still be poison

Round two of adversarial review found the deeper hole: a worker does not need write authority if it can submit content a later node trusts. A diff that stays perfectly in-path can still poison a generated prompt, a card, a test, a schema. Path checks cannot see semantic poisoning.

So all cargo is tainted until ratified

Every piece of content carries one of four trust labels: executable evidence memory instruction. Evidence never auto-promotes to instruction. Nothing generated enters the instruction channel of a future node without human ratification. Each artifact class gets its own validator; a path check is necessary, never sufficient.

The merge queue guards the whole, not the part

Validated branches enter a single-writer merge queue. Each merge runs the integration gate against the merged tree, not the branch alone, so "green on my branch, red on main" parks instead of promoting. Later ships build only on promoted, validated state, never on an unmerged sibling's work.

Secrets never board the ships

Workers request named credentials from a broker that issues the narrowest available form, time-limited, and logs the issuance (never the value) to the ledger. Raw environment files never appear in worker-visible state. Network is deny-by-default for non-research nodes; research ships sail through an egress proxy with a domain allowlist. Model-provider calls count as egress too.

Act VI · The war council

Adversarial review, minus the theater.

Consequential decisions (architecture changes, replans, release gates, security exceptions) convene a council: blue proposes with evidence, red attacks, a judge rules. We built it the hard way, by first reading the evidence that naive multi-agent debate does not work.

Blue squadron proposes

  • Arguments cite an evidence pack by hash: cards, test results, research artifacts
  • Criteria and weights are frozen before any participant runs
  • Uncited claims score lower by instruction

Red squadron attacks

  • Distinct lenses per attacker, not clones of one skeptic
  • At the top tier, panels must span at least two model families (+4 to 6 points, measured)
  • Two rounds maximum, then escalate to a human. No debate spirals.

The judge is audited

  • Order randomized; every pairwise judged twice with sides swapped
  • Swap-inconsistent and close: the ruling escalates instead of pretending confidence
  • The judge's model family must differ from every panelist's
  • Rulings AND dissents are recorded in the ledger

The shadow judge keeps us honest

  • Every top-tier council also runs one strong model, alone, same evidence, same criteria
  • If the lone judge agrees with the council 9 times out of 10, the council must prove it changes outcomes, or collapse
  • The same falsification discipline the product enforces, aimed at our own feature
red-team lesson: agents converge to agreement, not truth what survives: pre-committed criteria, heterogeneous panels, execution-backed checks dogfood scope: ships first as ONE independent checker; the full council is earned, not assumed
Act VII · Two voyages

Greenfield. Brownfield. Same fleet discipline.

New construction in open space, or a careful salvage of the structure your business already lives in. The fleet flies both.

Voyage one · Greenfield

Raise a new station from a blueprint

  1. Write the spec. The compiler drafts the flight plan and the sealed orders.
  2. The hygiene gate runs before launch: cycles, duplicates, and runaway discovery depth are quarantined on the ground.
  3. The carrier launches. Every module is built under contract from day one, budgets included.
  4. Dual-actor discipline, enforced not requested: the test-writing ship may only touch tests/; the implementing ship is forbidden from tests/. The harbormaster makes it physical.
  5. Gates green, station lit. The audit artifact already exists; nobody writes it after the fact.
Voyage two · Brownfield

Salvage the derelict without sinking it

  1. Survey ships chart the megastructure: seams, coupling, the contracts it actually honors.
  2. Districts become bounded contexts. A human ratifies the map; generated context alone measurably hurts.
  3. Characterization tests first. Freeze what the structure really does before touching a plate.
  4. Strangler extraction, one district at a time, while the old structure keeps running and serving.
  5. Supply lines never cut: contract gates verify A/B parity on every extraction, and every behavioral delta is presumed a legacy bug until proven otherwise.
Act VIII · The fleet review

Best in class, measured honestly.

Assemble today's best tools with full effort and you get a genuinely strong stack. We say that plainly, because a value proposition that pretends otherwise would not survive its first demo. Here is what each does brilliantly, where it falls short in the big picture, and what we do about it.

Temporal

COMPILE TO IT

BEST IN CLASSDurable execution. State survives any crash; a $5B company maintains the engine and its agent SDK integrations.

THE GAPYou still hand-fly every mission: each workflow is code you write, per campaign, and nothing checks the work still serves its intent.

OUR MOVEWe compile your intent into task-specific Temporal workflows. No compiler internals depend on Temporal-only concepts, so the substrate stays swappable by design.

Codex CLI · Claude Code · Cursor · Devin

DRIVE AS WORKERS

BEST IN CLASSSuperb single-session engineering. This layer is competitive and commoditized (Terminal-Bench 2.0, April 2026: Codex CLI at 82.2).

THE GAPBrilliant pilots, no fleet: no durable multi-day campaigns, no shared mission, no recall, no audit.

OUR MOVEThey fly for the fleet: terminal-heavy work to Codex, judgment work to Claude Code, driven over ACP with a strict exit-code contract.

OpenHands Agent Canvas · GitHub Agent HQ

ADOPT FOR HOSTING

BEST IN CLASSAlways-on fleet hosting, schedules, webhook triggers, mission-control views across heterogeneous agents.

THE GAPThey host sessions; nobody owns the mission. Nothing in the stack answers "what was this work FOR, and is it still serving that?"

OUR MOVEAdopt the hangar. Add sealed orders, drift adjudication with authority to halt, and a ledger that records the answer continuously.

Backstage · Port · Cortex

ADOPT THE ENVELOPE

BEST IN CLASSService catalogs agents can query; the catalog-as-agent-input pattern went mainstream in 2025-2026.

THE GAPRead-only and insight-only. Nobody compiles catalog context into governed execution.

OUR MOVEContext cards in a Backstage-compatible envelope, human-ratified, token-budgeted, injected per node at compile time.

Noma · Drata · Zenity

INTEGRATE + EXPORT

BEST IN CLASSAgent identity, permissions, runtime security policy. Registry-to-runtime coverage with real enterprise traction.

THE GAPPerimeter defense guards the ship, not the mission. Drifted-but-authorized work sails straight through.

OUR MOVEIntegrate, never compete: governance state, lineage, and audit events export to the controls you already bought.

LangSmith · Arize

ADOPT TRACING

BEST IN CLASSTraces, evals, dashboards. Observability for agent behavior is a solved buy.

THE GAPTraces show what happened. Audit requires what was ALLOWED: pre-committed criteria, admissible evidence, recorded dissent, why the gate opened.

OUR MOVEAdopt the traces; build the drift ledger and per-campaign audit artifact they cannot produce. Evidence-grade by construction.

Memory products (Mem0, Graphiti…)

BUILD DIFFERENTLY

BEST IN CLASSConversation memory with clever retrieval.

THE GAPNone is grounded in execution truth. Summarized chatter is not what the fleet actually did.

OUR MOVEEpisodes distilled from durable execution history, hash-linked to the raw record, redacted and budgeted. Lessons with receipts; research confirmed nobody ships this.

AutoMAS (research)

THE WHITE SPACE

BEST IN CLASSThe closest prior art to intent-to-orchestrator compilation anywhere in the literature.

THE GAPA research prototype. No durable-engine target, no governance contracts, no product.

OUR MOVEThis is the layer we build. The compiler is the product; everything else on this card list is a buy.

Assemble all of the above perfectly, and five holes remain

Nothing compiles intent

Every campaign is hand-wired glue: unversioned, unreviewed, different every time. The glue is where reliability dies.

No one owns the mission

Each tool holds a shard of state. None can answer whether the work still serves its purpose.

Stopping is a prompt, not a property

Budgets living in prompts and vigilance are not controls. The 2026 incident record agrees.

Traces are not audit

What happened is observability. What was allowed, who dissented, why the gate opened: that is audit.

The fleet never learns

No memory product is grounded in what actually ran. Execution history is the only honest teacher.

The honest boundary

If your agent work fits in interactive sessions, use Claude Code or Cursor and skip us.

If you need one scheduled job, Canvas or GitHub Actions is correct and sufficient.

If you need identity and runtime security policy, buy Noma or Drata; we integrate with them and do not replace them.

We buy twelve layers and build eight components, only where the research found nobody to buy from. When the market fills a hole, our own dependency register says swap, not defend.

Act IX · The red-team log

We hired an adversary twice. Both rounds drew blood.

Before writing the dogfood code, the design went through two chained adversarial research rounds (Parallel.ai, deepest tier). Findings below are from the actual reports, with what changed because of each.

CRITICALround 1
entry 01

Ships were writing their own star charts

Workers with direct write access to the work graph, ledger, and git could route around every gate, no matter what the prompts said.

WHAT CHANGED The harbormaster exists because of this finding. Enforcement moved below the agent layer: scoped per-node tokens, workers submit proposals, one privileged committer owns every write.

CRITICALround 1
entry 02

A legal-looking route can smuggle cargo

Path allowlists fall to symlinks, traversal, and check-then-write races (CWE-22, 59, 367; there is a 2026 CVE for an agent writing outside its workspace).

WHAT CHANGED Paths are canonicalized outside the agent, symlinks rejected on write targets, one worktree per ship, and a final git-diff validation as the backstop wall.

HIGHround 1
entry 03

A fuel gauge nobody debits is decoration

Judge calls, council runs, retries: if side costs do not debit the same budget, the halt is fiction.

WHAT CHANGED Metering became reserve, debit, reconcile. Providers are callable only through wrappers, and a metering outage trips the breaker rather than failing silent.

CRITICALround 2
entry 01

The cargo was legal. The manifest was a lie.

Round two found what round one missed: a worker needs no write authority if it can submit content a later node trusts. In-path diffs can poison prompts, cards, tests, schemas.

WHAT CHANGED Workers submit observations, never ledger events; the kernel derives records by deterministic rule. Four trust labels on all content, tainted until ratified, with type-specific validators per artifact class.

HIGHround 2
entry 02

We built one harbor so strong that every attack now aims at the harbor

The safety mass concentrated into five new chokepoints: the boundary, policy compilation, integration promotion, metering, and the founder's inbox.

WHAT CHANGED The boundary shrank to a minimal security kernel: eight typed commands, no LLM, no plugins, no shell, validators split from the one privileged committer, and eight formal invariants modeled before feature work.

HIGHround 2
entry 03

Every ship idling at the same dock

Running the full gate suite on every merge serializes the campaign at the exact point parallelism was bought. Queueing theory does not negotiate.

WHAT CHANGED Merge groups with fast affected-scope gates per merge; the slow full suite runs on a schedule. Parallelism is now a function of integration throughput, not wishful thinking.

HIGHround 2
entry 04

Governance became a ticket factory aimed at one human

A single incident could spawn six parallel process artifacts. The operator is part of the system; overload them and every stop light goes stale.

WHAT CHANGED One inbox item with facets instead of six workflows, and WIP limits that physically stop dispatch: one active campaign, five open inbox items, zero unresolved severe halts.

"Net safer against untrusted workers, but not yet safer as a system."
ADVERSARIAL ROUND 2, CLOSING VERDICT · 2026-07-02

The response: cut hard, prove it small

That verdict set the build order. The full platform waits; a minimal dogfood core ships first, and anything cut returns only if a real campaign concretely fails without it. The kernel stays tiny, pinned, testable, and observable, and a doctor command verifies the enforcement claims on the actual host before any campaign is allowed to launch. If the host cannot prove its sandbox tier, compilation fails. Honesty is checked, not assumed.

Eight invariants, formally modeled

Specified in TLA+/Alloy and enforced with property and chaos tests against the kernel before feature work:

SingleWriter NoPromoteOnRed NoDispatchAfterKill LedgerAppendOnly BudgetConservation PolicyReplay NoUnratifiedInstructionInjection AuthorityMonotonicityForDeny
Act X · Boarding

Built in the open, with a public kill criterion.

Ambient Agents is under active development. Here is exactly what is being built now, what is deliberately deferred, and the date on which we will prove it or fold it.

Building now: the dogfood core

  • The compiler: spec or work graph in, durable governed workflow out
  • The harbormaster kernel, with the content-poisoning fix baked in
  • Append-only, hash-chained ledger with heads signed outside the database
  • One policy engine; authority compiled from a typed AST, golden-tested
  • Metering as reserve, debit, reconcile; budget halts that actually fire
  • Honest sandbox tiers: the host proves enforcement or compilation fails
  • A minimal web cockpit: campaigns, budget burn, one inbox, ledger tail, kill, answer
  • Two workers: gsd-pi headless plus one ACP coding agent
  • The real dogfood campaign that all of this must survive

Deliberately deferred (cut until a real campaign fails without it)

  • The full red/blue council: one independent checker ships first
  • VS Code extension: web cockpit only
  • Backstage deployment: compatible card format, not the platform
  • Auto-injected memory: episodes stay searchable evidence, never instruction
  • Heavy retrieval stacks: deterministic card, grep, hybrid order first
  • Provider auto-swap: the breaker pauses instead; silent substitutions are what we exist to prevent
  • Enterprise document ingest: repo-only until data residency is designed
THE KILL CRITERION IS PUBLIC Live dogfood by 2027-01-02, or the product folds into existing tools and the retrospective gets published either way. The same pre-committed falsification discipline the fleet enforces on agents, applied to the fleet itself.