Skip to content

feat(engine): Batch trigger reloaded#2779

Merged
ericallam merged 35 commits into
mainfrom
feat/batch-trigger-v2
Dec 16, 2025
Merged

feat(engine): Batch trigger reloaded#2779
ericallam merged 35 commits into
mainfrom
feat/batch-trigger-v2

Conversation

@ericallam

@ericallam ericallam commented Dec 11, 2025

Copy link
Copy Markdown
Member

New batch trigger system with larger payloads, streaming ingestion, larger batch sizes, and a fair processing system.

This PR introduces a new FairQueue abstraction inspired by our own RunQueue that enables multi-tenant fair queueing with concurrency limits. The new BatchQueue is built on top of the FairQueue, and handles processing Batch triggers in a fair manner with per-environment concurrency limits defined per-org. Additionally, there is a global concurrency limit to prevent the BatchQueue system from creating too many runs too quickly, which can cause downstream issues.

For this new BatchQueue system we have a completely new batch trigger creation and ingestion system. Previously this was a single endpoint with a single JSON body that defined details about the batch as well as all the items in the batch.

We're introducing a two-phase batch trigger ingestion system. In the first phase, the BatchTaskRun record is created (and possibly rate limited). The second phase is another endpoint that accepts an NDJSON body with each line being a single item/run with payload and options.

At ingestion time all items are added to a queue, in order, and then processed by the BatchQueue system.

New batch trigger rate limits

This PR implements a new batch trigger specific rate limit, configured on the Organization.batchRateLimitConfig column, and defaults using these environment variables:

  • BATCH_RATE_LIMIT_REFILL_RATE defaults to 10
  • BATCH_RATE_LIMIT_REFILL_INTERVAL the duration interval, defaults to "10s"
  • BATCH_RATE_LIMIT_MAX defaults to 1200

This rate limiter is scoped to the environment ID and controls how many runs can be submitted via batch triggers per interval. The SDK handles the retrying side.

Batch queue concurrency limits

The new column Organization.batchQueueConcurrencyConfig now defines an org specific processingConcurrency value, with a backup of the env var BATCH_CONCURRENCY_LIMIT_DEFAULT which defaults to 10. This controls how many batch queue items are processed concurrently per environment.

There is also a global rate limit for the batch queue set via the BATCH_QUEUE_GLOBAL_RATE_LIMIT which defaults to being disabled. If set, the entire batch queue system won't process more than BATCH_QUEUE_GLOBAL_RATE_LIMIT items per second. This allows controlling the maximum number of runs created per second via batch triggers.

Batch trigger settings

  • STREAMING_BATCH_MAX_ITEMS controls the maximum number of items in a single batch
  • STREAMING_BATCH_ITEM_MAXIMUM_SIZE controls the maximum size of each item in a batch
  • BATCH_CONCURRENCY_DEFAULT_CONCURRENCY controls the default environment concurrency
  • BATCH_QUEUE_DRR_QUANTUM how many credits each environment gets each round for the DRR scheduler
  • BATCH_QUEUE_MAX_DEFICIT the maximum deficit for the DRR scheduler
  • BATCH_QUEUE_CONSUMER_COUNT how many queue consumers to run
  • BATCH_QUEUE_CONSUMER_INTERVAL_MS how frequently they poll for items in the queue

Configuration Recommendations by Use Case

High-throughput priority (fairness acceptable at 0.98+):

BATCH_QUEUE_DRR_QUANTUM=25
BATCH_QUEUE_MAX_DEFICIT=100
BATCH_QUEUE_CONSUMER_COUNT=10
BATCH_QUEUE_CONSUMER_INTERVAL_MS=50
BATCH_CONCURRENCY_DEFAULT_CONCURRENCY=25

Strict fairness priority (throughput can be lower):

BATCH_QUEUE_DRR_QUANTUM=5
BATCH_QUEUE_MAX_DEFICIT=25
BATCH_QUEUE_CONSUMER_COUNT=3
BATCH_QUEUE_CONSUMER_INTERVAL_MS=100
BATCH_CONCURRENCY_DEFAULT_CONCURRENCY=5

Todo

  • Setup cloud processingConcurrency limits for orgs depending on pricing tier

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants