Skip to content

feat(run-ops): webapp write path — trigger/batch minting, idempotency routing, run lifecycle#4118

Draft
d-cs wants to merge 29 commits into
mainfrom
runops/pr06-write-path
Draft

feat(run-ops): webapp write path — trigger/batch minting, idempotency routing, run lifecycle#4118
d-cs wants to merge 29 commits into
mainfrom
runops/pr06-write-path

Conversation

@d-cs

@d-cs d-cs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

What

Routes the webapp write path through the run-ops split seam: trigger/batch minting, idempotency-key resolution, and the run-lifecycle services now determine residency and dispatch writes to the correct store.

  • Trigger & batch (runEngine/services/triggerTask.server.ts, batchTrigger.server.ts, createBatch.server.ts, streamBatchItems.server.ts, v3/services/batchTriggerV3.server.ts): mint ids with the run-ops-aware minting and route creation/streaming through the store; batch children inherit the parent's residency.
  • Idempotency (runEngine/concerns/idempotencyKeys.server.ts + new idempotencyResidency.server.ts): idempotency-key lookup/dedup is residency-aware so a keyed retrigger resolves against the store that owns the original run.
  • Run lifecycle services (createCheckpoint, createTaskRunAttempt, enqueueDelayedRun, expireEnqueuedRun, finalizeTaskRun, resumeBatchRun, cancelDevSessionRuns, executeTasksWaitingForDeploy, triggerFailedTask): resolve their target run through the store rather than a fixed client.
  • Reads that fan out from writes (runsRepository + clickhouseRunsRepository, BulkActionV2 + batch read-through, realtime sessions/runReader, alerts deliverAlert/performTaskRunAlerts): route through the read-through resolver.
  • 9535ae63d — resolves the parent run through an injectable run store in TriggerFailedTaskService.
  • bf8f7c881 — drops the "known-migrated" concept from write-path and read repos; residency is id-shape only.
  • 515b897ea — self-defaults resolveWaitpointThroughReadThrough to the safe run-ops clients.

Why

PR6 of the run-ops split stack. This is the write-path counterpart to the read foundation in the previous PRs: with it in place, both reads and writes route through the seam. Additive when the split is disabled (id-shape resolution collapses to the control-plane client); behavior-changing on the minting, idempotency, and lifecycle paths when enabled.

Tests

Large new/expanded vitest suite under apps/webapp/test/ and colocated service tests: trigger-task and batch-trigger store routing, residency inheritance, idempotency dedup residency + legacy-authority, bulk-action read routing, cancel-dev-session routing, alerts store routing, runs-repository read-through, realtime session/run-reader read-through and stream-registration routing, and the waitpoint read-through default. Testcontainers-backed; no mocks.

Notes

Draft, stacked on #4117 (runops/pr05-webapp-foundation). Review that first; this diff is against it.

Server-change / changeset note to be added at stack-assembly time.

🤖 Generated with Claude Code

@changeset-bot

changeset-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 4bda37a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ffc3cf87-a202-40d1-9339-d287c1dd2cbd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch runops/pr06-write-path

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

Comment thread apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from 413a945 to 99643f8 Compare July 2, 2026 18:02
@d-cs d-cs force-pushed the runops/pr06-write-path branch from 515b897 to cb97148 Compare July 2, 2026 18:02
@pkg-pr-new

pkg-pr-new Bot commented Jul 2, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@4bda37a

trigger.dev

npm i https://pkg.pr.new/trigger.dev@4bda37a

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@4bda37a

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@4bda37a

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@4bda37a

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@4bda37a

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@4bda37a

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@4bda37a

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@4bda37a

commit: 4bda37a

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts (1)

245-294: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

blockRunWithWaitpoint still writes via this.prisma, not the resolved dedupClient.

dedupClient (computed above at line 155-174) is derived using parentRunFriendlyId: request.body.options?.parentRunId — the exact same parentRunId used here to block the parent run's waitpoint (line 246, 279). dedupClient is precisely the client that owns this parent run's residency, yet the transaction at line 290 still passes tx: this.prisma, the (possibly wrong) fallback client.

If split mode is enabled and the parent run resides on the "new" store while this.prisma targets the legacy store (or vice versa), this write would target the wrong database, failing to find the parent run row or silently writing state to a store that doesn't own it — contradicting the PR's core objective of routing writes to the store that owns the target run.

🐛 Proposed fix
             await this.engine.blockRunWithWaitpoint({
               runId: RunId.fromFriendlyId(parentRunId),
               waitpoints: associatedWaitpoint!.id,
               spanIdToComplete: spanId,
               batch: request.options?.batchId
                 ? {
                     id: request.options.batchId,
                     index: request.options.batchIndex ?? 0,
                   }
                 : undefined,
               projectId: request.environment.projectId,
               organizationId: request.environment.organizationId,
-              tx: this.prisma,
+              tx: dedupClient,
             });
🧹 Nitpick comments (8)
apps/webapp/app/runEngine/concerns/resolveWaitpointThroughReadThrough.server.ts (1)

44-49: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Consider forwarding logger/onLegacyReplicaRead for parity with other read-through consumers.

ReadThroughDeps supports logger and onLegacyReplicaRead (saturation-signal hook), but ResolveWaitpointDeps/this wrapper drop both, so legacy-replica reads for waitpoints won't emit the saturation signal that other read-through call sites presumably rely on for monitoring split-read health.

apps/webapp/app/runEngine/services/batchTrigger.server.ts (2)

92-99: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Discards the minted id and re-derives it via BatchId.fromFriendlyId.

mintBatchFriendlyId returns { id, friendlyId }, but only friendlyId is kept here; id is recomputed later via BatchId.fromFriendlyId(batchId) (lines 175, 266). createBatch.server.ts uses the returned id directly instead of re-deriving it. Functionally likely equivalent if BatchId.fromFriendlyId is a lossless decode, but it's redundant work and an inconsistency between two services from the same PR doing the same job.

♻️ Proposed consistency fix
-          const { friendlyId } = await mintBatchFriendlyId({
+          const { id, friendlyId } = await mintBatchFriendlyId({
             environment: {
               organizationId: environment.organizationId,
               id: environment.id,
               orgFeatureFlags: environment.organization.featureFlags,
             },
             parentRunFriendlyId: body.parentRunId,
           });

Then thread id through to #createAndProcessBatchTaskRun and use it directly instead of BatchId.fromFriendlyId(batchId).

Please confirm BatchId.fromFriendlyId reliably reconstructs the same id for both ksuid- and cuid-shaped friendly ids before treating this purely as a style nit.

Also applies to: 169-184, 265-275


359-374: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Missing batch is silently ignored — no log emitted.

When findBatchTaskRunById returns nothing, the function returns silently, unlike the environment miss two lines below which logs an error. Given store-routing bugs could make a batch invisible from the wrong store, this failure mode deserves the same observability.

🔍 Proposed fix
     const batch = await this._engine.runStore.findBatchTaskRunById(options.batchId);

     if (!batch) {
+      logger.error("[RunEngineBatchTrigger][processBatchTaskRun] Batch not found", {
+        options,
+      });
       return;
     }
apps/webapp/test/engine/streamBatchItems.test.ts (1)

655-662: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Repeated PostgresRunStore wiring across 7 test cases.

The same 4-line block assigning engine.runStore = new PostgresRunStore({ prisma: racingPrisma, readOnlyPrisma: racingPrisma }) appears 7 times. A small helper (e.g. attachRacingRunStore(engine, racingPrisma)) would reduce duplication and centralize any future changes to how the racing store is wired.

Also applies to: 787-794, 919-926, 1052-1059, 1272-1279, 1411-1418, 1600-1604

apps/webapp/app/v3/services/createCheckpoint.server.ts (1)

149-154: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Batch lookup correctly routed through RunStore.

Both WAIT_FOR_BATCH lookups now call runStore.findBatchTaskRunByFriendlyId(friendlyId, environmentId), matching the upstream contract that probes NEW then LEGACY sub-stores scoped by (friendlyId, environmentId). This correctly fixes the pre-existing gap where a raw single-DB Prisma query would miss a NEW-resident (ksuid) batch.

The identical 5-line lookup + comment block is duplicated at both call sites (149-154 and 364-369). Extracting a small private helper would remove the duplication and centralize any future changes to the routing logic.

♻️ Proposed extraction
+  // Routed by friendlyId so a ksuid (NEW-resident) batch is found on the owning DB;
+  // env-scoped to the dependent attempt's run (a batch shares its dependent's env).
+  private async findWaitForBatchRun(batchFriendlyId: string, environmentId: string) {
+    return this.runStore.findBatchTaskRunByFriendlyId(batchFriendlyId, environmentId);
+  }

Then replace both call sites with:

-        // Routed by friendlyId so a ksuid (NEW-resident) batch is found on the owning DB;
-        // env-scoped to the dependent attempt's run (a batch shares its dependent's env).
-        const batchRun = await this.runStore.findBatchTaskRunByFriendlyId(
-          reason.batchFriendlyId,
-          attempt.taskRun.runtimeEnvironmentId
-        );
+        const batchRun = await this.findWaitForBatchRun(
+          reason.batchFriendlyId,
+          attempt.taskRun.runtimeEnvironmentId
+        );

Also applies to: 364-369

apps/webapp/app/v3/services/executeTasksWaitingForDeploy.ts (1)

74-111: 🩺 Stability & Availability | 🔵 Trivial

Solid defense-in-depth split; consider a quarantine path for stuck NEW-resident runs.

The NEW/legacy split correctly prevents a control-plane updateMany/enqueue from touching runs it can't actually own. One residual risk: if a NEW-resident run keeps getting selected by findRuns (e.g. from a real misconfiguration), it will never transition out of WAITING_FOR_DEPLOY, so this job will re-log the same error and potentially keep rescheduling itself (via the runsWaitingForDeploy.length > maxCount reschedule) on every poll, indefinitely.

Consider adding a metric/alert-worthy signal or a way to skip re-selecting known-stuck NEW-resident runs.

apps/webapp/test/engine/triggerTask.test.ts (1)

2393-2402: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Redundant dynamic import()generateKsuidId is already statically imported.

generateKsuidId is imported statically at the top of the file (line 33) and used directly a few lines later (line 2432: generateKsuidId()). The dynamic await import("@trigger.dev/core/v3/isomorphic") here is unnecessary and inconsistent with the static-import usage elsewhere in the same file.

♻️ Proposed fix
       const parentFriendlyId = RunId.toFriendlyId(
-        // 27-char ksuid → classifies NEW
-        (await import("`@trigger.dev/core/v3/isomorphic`")).generateKsuidId()
+        // 27-char ksuid → classifies NEW
+        generateKsuidId()
       );

As per coding guidelines: "Prefer static imports over dynamic import(), and only use dynamic imports for unresolved circular dependencies, genuine code-splitting needs, or conditional runtime loading."

Source: Coding guidelines

apps/webapp/test/idempotencyDedupResidency.test.ts (1)

45-103: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicate seeding helpers across split-seam test files.

seedOrgProjectEnv/seedRun here are re-implemented nearly verbatim in apps/webapp/test/idempotencyKeyConcernLegacyAuthority.test.ts and apps/webapp/test/resetIdempotencyKeyLegacyAuthority.test.ts. Consider extracting a shared heteroPostgresTest fixture-seeding helper module as this residency test suite keeps growing.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d490b21c-676d-49ba-83a2-67771105b181

📥 Commits

Reviewing files that changed from the base of the PR and between 413a945 and 515b897.

📒 Files selected for processing (50)
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/concerns/idempotencyResidency.server.test.ts
  • apps/webapp/app/runEngine/concerns/idempotencyResidency.server.ts
  • apps/webapp/app/runEngine/concerns/resolveWaitpointThroughReadThrough.server.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/runEngine/services/triggerFailedTask.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.test.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/services/archiveBranch.server.ts
  • apps/webapp/app/services/dashboardAgent.server.ts
  • apps/webapp/app/services/deleteProject.server.ts
  • apps/webapp/app/services/realtime/runReader.server.ts
  • apps/webapp/app/services/realtime/sessions.server.ts
  • apps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.ts
  • apps/webapp/app/services/runsRepository/runsRepository.server.ts
  • apps/webapp/app/v3/services/alerts/deliverAlert.server.ts
  • apps/webapp/app/v3/services/alerts/performTaskRunAlerts.server.ts
  • apps/webapp/app/v3/services/batchTriggerV3.server.ts
  • apps/webapp/app/v3/services/bulk/BulkActionV2.batchReadThrough.server.test.ts
  • apps/webapp/app/v3/services/bulk/BulkActionV2.batchReadThrough.server.ts
  • apps/webapp/app/v3/services/bulk/BulkActionV2.server.ts
  • apps/webapp/app/v3/services/cancelDevSessionRuns.server.ts
  • apps/webapp/app/v3/services/createCheckpoint.server.ts
  • apps/webapp/app/v3/services/createTaskRunAttempt.server.ts
  • apps/webapp/app/v3/services/enqueueDelayedRun.server.ts
  • apps/webapp/app/v3/services/executeTasksWaitingForDeploy.ts
  • apps/webapp/app/v3/services/expireEnqueuedRun.server.ts
  • apps/webapp/app/v3/services/finalizeTaskRun.server.ts
  • apps/webapp/app/v3/services/resumeBatchRun.server.ts
  • apps/webapp/test/batchTriggerV3ResidencyInheritance.test.ts
  • apps/webapp/test/batchTriggerV3StoreRouting.test.ts
  • apps/webapp/test/bulkActionV2ReadRouting.test.ts
  • apps/webapp/test/cancelDevSessionRunsStoreRouting.test.ts
  • apps/webapp/test/engine/streamBatchItems.test.ts
  • apps/webapp/test/engine/triggerFailedTask.test.ts
  • apps/webapp/test/engine/triggerTask.test.ts
  • apps/webapp/test/idempotencyDedupResidency.test.ts
  • apps/webapp/test/idempotencyKeyConcernLegacyAuthority.test.ts
  • apps/webapp/test/performTaskRunAlertsStoreRouting.test.ts
  • apps/webapp/test/realtime/runReaderReadThrough.test.ts
  • apps/webapp/test/realtime/streamRegistrationRouting.test.ts
  • apps/webapp/test/resetIdempotencyKeyLegacyAuthority.test.ts
  • apps/webapp/test/resolveWaitpointThroughReadThrough.readthrough.test.ts
  • apps/webapp/test/runEngineBatchTriggerStoreRouting.test.ts
  • apps/webapp/test/runsRepository.readthrough.test.ts
  • apps/webapp/test/runsRepositoryCpres.test.ts
  • apps/webapp/test/sessions.readthrough.test.ts
  • apps/webapp/test/streamLoader.controlPlane.test.ts

Comment thread apps/webapp/app/runEngine/services/batchTrigger.server.ts Outdated
Comment thread apps/webapp/app/runEngine/services/streamBatchItems.server.ts
Comment thread apps/webapp/app/v3/services/alerts/deliverAlert.server.ts
Comment thread apps/webapp/test/cancelDevSessionRunsStoreRouting.test.ts
Comment thread apps/webapp/test/performTaskRunAlertsStoreRouting.test.ts
@d-cs

d-cs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the outside-diff note on idempotencyKeys.server.ts (blockRunWithWaitpoint writing via this.prisma): the block now passes the residency-resolved dedupClient as tx, so the idempotent parent run's waitpoint write lands on the store that owns that parent run rather than the fallback client.

@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from 26871d5 to cdc4eb9 Compare July 2, 2026 19:25
@d-cs d-cs force-pushed the runops/pr06-write-path branch from c59d9c5 to d5d7fa1 Compare July 2, 2026 19:25
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from cdc4eb9 to e0b35d5 Compare July 2, 2026 20:21
@d-cs d-cs force-pushed the runops/pr06-write-path branch 3 times, most recently from 0db90f0 to d5415e8 Compare July 2, 2026 21:44
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from 8024e36 to f9b9b0b Compare July 3, 2026 08:51
@d-cs d-cs force-pushed the runops/pr06-write-path branch from aa55b6b to 3153bc4 Compare July 3, 2026 08:51
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from f9b9b0b to 0937b15 Compare July 3, 2026 10:02
@d-cs d-cs force-pushed the runops/pr06-write-path branch from 3153bc4 to d561590 Compare July 3, 2026 10:02
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from 0937b15 to 729daf1 Compare July 3, 2026 10:36
@d-cs d-cs force-pushed the runops/pr06-write-path branch from d561590 to 9e7c367 Compare July 3, 2026 10:36
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from 729daf1 to bd6fc79 Compare July 3, 2026 10:44
@d-cs d-cs force-pushed the runops/pr06-write-path branch from 9e7c367 to e23432d Compare July 3, 2026 10:44
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from bd6fc79 to a7e0846 Compare July 3, 2026 11:08
@d-cs d-cs force-pushed the runops/pr06-write-path branch from e23432d to 8dff8b2 Compare July 3, 2026 11:08
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from a7e0846 to 4119616 Compare July 3, 2026 12:08
@d-cs d-cs force-pushed the runops/pr06-write-path branch 2 times, most recently from 891d81a to 5140cbc Compare July 3, 2026 15:42
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from d087c25 to b554794 Compare July 3, 2026 16:33
d-cs and others added 27 commits July 3, 2026 17:43
…eration labels

Add a pure unit test for ControlPlaneCache covering per-slot round-trips,
null-vs-miss distinction, epoch-based invalidation, per-slot key isolation,
bounded eviction, and TTL expiry. Add a testcontainer test for
probeDistinctDatabases covering distinct clusters, same physical database
(with reason), same-cluster-different-database, and fail-closed probe failure.

Strip developer-enumeration labels from three existing test files (readThrough
step numbers, runEngineHandlers Test-X comments) and rename the run-detail
loader read-through test to drop the non-domain "shape 1" name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… deps

apps/webapp/package.json declares @internal/run-ops-database (workspace) and
@testcontainers/postgresql but the lockfile importer entry was never regenerated,
so pnpm install --frozen-lockfile fails for the webapp. Regenerate the importer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enabling RUN_OPS_SPLIT_ENABLED without REALTIME_BACKEND_NATIVE_ENABLED
silently breaks realtime: Electric replicates only from the control-plane
DB, so NEW-resident (ksuid) runs on the dedicated run-ops DB are invisible
and every realtime subscription hangs.

Add a boot-time interlock that refuses split mode in that misconfiguration,
mirroring the existing distinct-DB data-loss sentinel. The check is a pure
predicate (assertSplitRealtimeInterlock) run synchronously inside
assertRunOpsSplitSentinel on the same eager-boot path, failing fast before
the async DB probe and before any run-ops routing is wired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n diagnostics

- gate runOpsTopology splitEnabled on RUN_OPS_SPLIT_ENABLED so provisioning
  both DSNs before flipping the flag cannot open a second pool or route writes
  ahead of the distinct-DB sentinel
- rethrow the original UnclassifiableRunId in the cross-seam guard so its
  value/valueLength keep reflecting the real waitpoint id
- log run-found-but-environment-unresolved distinctly from missing-run
- correct the RUN_OPS_DATABASE_URL doc comment (Prisma datasource, not the
  webapp runtime pool)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uth-env through the cache-first resolver

The ControlPlaneCache served env/org data with no invalidation, so admin/control-plane
writes were only reflected after the TTL. Add two invalidation scopes to the cache
(invalidateEnvironment for one env's slots; invalidateOrganization via a per-org epoch that
env/authEnv values are stamped with, so all of an org's cached rows drop with no reverse
index), expose them on the resolver, and call them at every write site that mutates
cache-served data: pause/resume, archive, env/org concurrency + burst-factor, API-key
regeneration, feature flags, API/batch rate limits, runs enable/disable, org + project
delete, and stream-basin provisioning.

Also extend the resolver's authenticated-env slot to carry `git` and make the run-engine
adapter's resolveAuthenticatedEnv delegate to the cache-first, split-aware resolver instead
of issuing its own $replica.findFirst, so it honors splitEnabled() and the cache like its
siblings while still returning `git` and the deleted-project guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… OFF

With the split OFF there is a single DB, so a run and its environment are
co-located and there is no cross-seam FK/check to replace (matches main).
Skip the always-on hot-path read in that branch; the split-ON branch is
unchanged (cache-first, throws on a genuinely missing env).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… routing, run lifecycle

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e in TriggerFailedTaskService

TriggerFailedTaskService read the parent run via the ambient module-singleton
store while the engine wrote the run through its own store, so a ksuid parent's
row was not found and parentTaskRunId came back null. Add an optional injected
runStore (defaults to the shared singleton, preserving production behaviour) and
resolve the parent through it at both call sites, mirroring triggerTask.server.ts.

Align the three affected webapp tests to read through the same store the engine
wrote to: triggerFailedTask.test.ts passes engine.runStore; performTaskRunAlerts
routing passes a passthrough store over the seeded container; triggerTask.test.ts
stubs the run-ops db handles and pins split mode off so the idempotency dedup uses
the container client.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…id-shape only

Migration is deferred, so child/batch residency is a pure id-shape check.
Remove the isKnownMigrated (and mint-only isSplitEnabled) deps from the mint
sites (triggerTask, triggerFailedTask, batchTriggerV3) and call the now-
synchronous resolveInheritedMintKind(parentFriendlyId) with no deps arg.

Read paths: drop the isKnownMigrated re-probe-avoidance from the ClickHouse
runs hydrate (probe all missing on legacy), the runsRepository readThrough
options type, resolveWaitpointThroughReadThrough deps, and the BulkActionV2
batch seam adapter — keeping the genuine cross-seam fallback that reads NEW
first for unclassifiable/legacy-candidate ids.

Delete the injected-marker test cases; the remaining residency tests assert
pure id-shape inheritance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s and test names

Review hygiene only: remove the NEW-1 label, Test X: name prefixes, and
[TEST-NEWSEED] comment label. No product logic or test behavior changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o safe run-ops clients

The read-through concern defaulted both newClient and legacyReplica to $replica
(control-plane), so a bare caller that omits `deps` — the waitpoints wait route —
never queried the dedicated run-ops replica. A co-located, NEW-resident waitpoint
minted by streams.input().wait() lives on the run-ops-new DB, so the read missed,
returned null, and the route 404'd (re-serialized to 500).

Match the deps the complete/callback routes pass: default newClient to
runOpsNewReplica, legacyReplica to $replica, and splitEnabled to
runOpsSplitReadEnabled — mirroring readThroughRun's own self-defaulting. This
immunizes any bare caller (present or future) against the control-plane pin,
without touching the wait route. The wait/complete/callback call sites live on a
higher branch and are unchanged; complete/callback keep their explicit deps
(now redundant but harmless).

Adds a heteroRunOps regression case driving the concern with no `deps` via the
`defaults` DI seam: proves the old $replica default misses a NEW-resident
waitpoint (null) while the safe run-ops default finds it. No mocks; the fallback
is exercised against real PG14/PG17 containers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rvice prisma to the resolved store

- Block the idempotent parent run's waitpoint via the residency-resolved
  dedup client instead of the fallback prisma, so the write lands on the
  store that owns the parent run.
- Pass the caller-provided _prisma into WithRunEngine so a custom store
  isn't silently overridden by the module singleton.
- Throw when a run-backed alert's environment can't be resolved instead of
  marking it SENT, so a transient replica miss doesn't permanently suppress
  the alert.
- Pin splitEnabled:false in the waitpoint passthrough test so it exercises
  single-DB behaviour rather than relying on ksuid residency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The write-path split added static `runOpsLegacyPrisma`/`runOpsNewPrisma`
imports to idempotencyKeys.server.ts, which this test loads. vitest
validates every named import against the `~/db.server` mock, so the mock
now errored on the missing run-ops singletons. Add the four run-ops
exports (empty stubs, same boundary pattern as the batchTriggerV3
residency test) and pin isSplitEnabled() to false so the dedup routing
deterministically returns the injected fake prisma regardless of the
ambient RUN_OPS_SPLIT_ENABLED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…setup

Worker/engine/marqs/pubsub/socket singletons each construct an ioredis
client at import time (singleton() + no lazyConnect), so any test importing
the service graph opened real Redis connections on import. In CI there is no
Redis, so these accumulate infinite-retry clients across a shard and take
the suite down (locally they pass only because dev Redis is up).

Globally mock the eager-Redis modules to no-op stubs in test/setup.ts:
commonWorker, batchTriggerWorker, legacyRunEngineWorker, alertsWorker,
the RunEngine and MarQS singletons, devPubSub and the socket.io server.
Only these singletons are mocked — never the run store (~/v3/runStore.server,
~/db.server), which store-routing/residency tests need real against
testcontainer Postgres.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…yConnect + stub runtime Redis singletons

The setup-file mocks of the six eager worker/engine singletons were not
enough: CI shards still flooded ECONNREFUSED/maxRetries. Two further
classes of env-Redis usage survived them, reproduced locally by running
the failing shards with REDIS_PORT pointed at a dead port:

1. Import-time construction: ~15 more singletons (platform cache,
   billing-limit reconcile queue, alerts rate limiter, DevPresence,
   auto-increment counter, s2 token cache, v1 streams cache, ...)
   build ioredis clients at module import, and ioredis dials on
   construction. A global ioredis mock now forces lazyConnect: true so
   clients only dial on first command — testcontainer-backed tests are
   unaffected (their first command connects as before).

2. Runtime commands inside code under test: tracePubSub.publish()
   (eventRepository writes), alertsRateLimiter.check() (deliverAlert)
   and the task metadata cache each issue commands against
   env-configured Redis mid-test; every command burns ~20 reconnect
   cycles before its error surfaces, which times the tests out. These
   three modules are now stubbed (metadata cache pinned to its Noop
   implementation, which is what CI's unset env resolves to anyway).

Verified: webapp shards 2/5/6/8 (the ones failing on the pr06+ stack)
run green with Redis pointed at a dead port, and shards 2/8 stay green
against live Redis (store-routing suites still exercise the real run
store).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in CI

CI runners have no .env, no REDIS_HOST/REDIS_PORT, and no Postgres at
localhost:5432, which surfaced two failure layers that local runs mask
(the dev stack answers on both):

- suites transitively importing triggerTaskV1.server failed to collect
  because autoIncrementCounter.server.ts throws at import when
  REDIS_HOST/REDIS_PORT are unset (shards 2/5/6). Default the pair in
  test/setup.ts — the global ioredis lazyConnect mock means nothing dials.
- TriggerFailedTaskService.call() resolved its event repository via
  getEventRepository → global prisma (feature-flag read + Prisma event
  repo), so in CI the swallowed connect error returned null friendlyIds
  (shard 8). Allow injecting the repository/store pair and bind the test
  to an EventRepository over the testcontainer DB.
- once the cancelDevSessionRuns suite could collect, findLatestSession's
  hardwired global $replica was the next masked layer; give it an
  injectable client (defaulting to $replica) and pass the service's
  _replica through.

Verified by replaying the exact CI env locally (.env hidden, workflow env
vars, dead localhost DB, GITHUB_ACTIONS set): all four failing suites and
full shards 2/5/6/8 reproduce the CI failures before and pass after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…method access

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…am regressions

Two regression tests for the write-path read seams:
- runsRepository: paginating the full keyset over interleaved cuid/ksuid runs
  enumerates every id once, no empty page, in ClickHouse (created_at DESC, run_id
  DESC) order -- fails if hydration reverts to lexical id desc across the id-space
  seam.
- runReader: a NEW-resident (ksuid) run's terminal metadata hydrates through the
  owning store, never a generic legacy replica.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the runops/pr05-webapp-foundation branch from b554794 to 071cdc1 Compare July 3, 2026 16:44
@d-cs d-cs force-pushed the runops/pr06-write-path branch from f8f3096 to 4bda37a Compare July 3, 2026 16:44
Base automatically changed from runops/pr05-webapp-foundation to main July 3, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant