Workflows Payload-Pointer Pattern

ADR-046 — Workflows Payload-Pointer Pattern

Status: Accepted
Date: 2026-05-07
Decider: Mishaal Murawala (delegated engineering judgment to Claude Code as engineering lead)
Supersedes: none
Related: ASCEND_OPERATOR_OS_ENGINEERING_STANDARD.md §6, ADR-026

Context

Workers Workflows are the canonical multi-step orchestration primitive for the Operator OS (per ADR-026 and Vision invariant 9 — “CF Cron + CF Workflows for scheduled and multi-step work”). Workflows persist their state durably so steps can replay across restarts, deploys, and Worker isolate evictions.

Workflow state is bounded. Cloudflare’s documented payload-size limits today (1 MB per step input, 100 KB recommended for routine reliability), combined with the operational reality of replaying state on restart, mean that passing large payloads through Workflow step inputs is a foot-gun:

Replay overhead grows linearly with payload size.
Network/serialization cost compounds across step boundaries.
Debugging a 7-day-old failed Workflow run requires extracting the in-flight payload from state, which is opaque.
A single oversized step input can fail the entire Workflow with a non-obvious error.

The Operator OS routinely handles artifacts that exceed these limits: long Gong call transcripts, multi-thousand-row CRM exports, LLM responses with embedded JSON arrays, anonymized pattern bank outputs, full email thread bodies. Without a discipline, these will end up inside Workflow step inputs and break things.

Multi-model review (Gemini 3.1 Pro specifically) flagged this as one of the two highest-likelihood Q1 incident classes.

Decision

Any artifact larger than 64 KB or any binary blob is written to R2 and passed through Workflow steps as a pointer (R2 key + content hash + size), never as bytes.

The pattern

Producer step writes to R2.

const r2Key = `workflow-payloads/${tenantId}/${workflowName}/${runId}/${stepName}.json`;
await env.WORKFLOW_PAYLOADS.put(r2Key, JSON.stringify(artifact), {
  httpMetadata: { contentType: 'application/json' },
});
const sha256 = await contentHashHex(artifact);
return { r2Key, sha256, sizeBytes: artifact.length };

Step return value is the pointer, not the bytes.

Consumer step fetches by key.

const obj = await env.WORKFLOW_PAYLOADS.get(pointer.r2Key);
if (!obj) throw new Error(`payload missing: ${pointer.r2Key}`);
const actual = await contentHashHex(await obj.text());
if (actual !== pointer.sha256) throw new Error('payload hash mismatch');
const artifact = JSON.parse(await obj.text());

Workflow run metadata records the pointer in D1 workflow_runs (cold-path table) so a debugger can fetch the same artifact 7 days later.

Naming convention

r2://workflow-payloads/{tenant_id}/{workflow_name}/{run_id}/{step_name}.{ext}

tenant_id — for tenant isolation (per ADR-045).
workflow_name — for grouping (e.g., gong-ingest, sdr-agent-run, pattern-bank-anonymize).
run_id — UUID of the Workflow instance.
step_name — matches the step.do(name, ...) name for traceability.
.ext — .json for structured, .txt for plain, .bin for binary.

Threshold

64 KB threshold for “is this a pointer or a value.” Below 64 KB, pass the value inline (faster, fewer R2 ops). Above, always pointer.
Any binary blob (e.g., audio file from Gong, image from a generated asset) is always a pointer regardless of size.

Lifecycle

R2 versioning is enabled on the workflow-payloads bucket.
Lifecycle policy: delete payloads older than 30 days. Workflows that need long-term retention copy the artifact to a different R2 prefix (e.g., r2://artifacts/{tenant_id}/{type}/...).
Quarterly: confirm a sample of pointers from old Workflow runs are still resolvable for debugging.

Alternatives considered

Inline everything; bump step input limits via tickets. Cloudflare limits are not negotiable per-account in a useful way, and this approach kicks the can on replay overhead and debuggability.
Use D1 instead of R2 for payloads. D1 columns can hold blobs but are not ideal for opaque-large data; R2 is purpose-built for this and is cheaper per byte. R2 also gives versioning + lifecycle out of the box.
Pass through KV. KV value size limit is 25 MB, which is generous, but KV is the hot-path store and we explicitly do not want Workflow artifacts polluting it. Also KV cost model (writes priced per-write) is worse than R2 for write-once-read-rarely artifacts.
Compress inline above 64 KB. Compression on JSON typically gets 5-10x, which would push some payloads back under the limit. But the failure mode is unpredictable (ratio varies), and compress/decompress eats CPU in a Worker that has tight CPU limits. Pointer pattern is more predictable.

Consequences

Wins

Workflow state is small and replay is cheap.
Debugging is straightforward: pointer → R2 GET → inspect.
Tenant isolation extends to Workflow artifacts via R2 prefix structure.
Lifecycle policy gives automatic GC.
A Workflow that fails mid-run can be re-driven from the last successful step’s pointer rather than replaying from input.

Costs

Two R2 ops (PUT + GET) per artifact handoff. R2 ops are cheap but non-zero — ~$0.0036 / 1k Class A ops + free Class B reads.
One extra failure mode: payload missing from R2 (covered by hash-check + explicit error).
Boilerplate per step (helper module src/lib/workflow-payload.ts will absorb most of this — to be created when first Operator OS Q1 Workflow lands).

Open items (tracked)

src/lib/workflow-payload.ts — helper module with writePayload(), readPayload(), verifyHash() to be created on first Q1 Workflow merge. Tech-debt row OO-046-001.
wrangler.toml — declare WORKFLOW_PAYLOADS R2 binding before first Q1 Workflow merge.
D1 workflow_runs table — to be added in the migration that ships the first Q1 Workflow.
Lifecycle policy on R2 bucket — set up at the same time.

Reversal criteria

Cloudflare ships per-step limits that exceed the 99th-percentile artifact size in our system → review threshold, possibly raise from 64 KB.
A single Workflow type proves to be entirely sub-64KB and the boilerplate is unjustified there → that one Workflow can opt out via comment justification, but the helper still exists for the rest.
R2 cost overruns relative to KV → reconsider, but unlikely given write-once-read-rarely access pattern.

The pattern itself is not reversed; specific thresholds may be tuned.