AIR CLI Integration: `air run` end to end command by riddhibhagwat-db · Pull Request #5710 · databricks/cli

riddhibhagwat-db · 2026-06-24T17:52:57Z

Changes

Implements the air run happy path on top of the config schema (#5657), submitting a one-time training run through the Jobs API. Five commits, one per phase:

run config launch accessors: flatten the validated config into launch values (timeout seconds, retry default, requirements file-vs-inline, runtime version).
wire run command (load, validate, dry-run): air run -f loads + structurally validates the YAML; --dry-run validates offline (no workspace/auth) and returns; --override/--watch are rejected for now with clear errors (ported in future PR).
pre-submit resolution: resolve current user / workspace home / a unique cli_launch dir, and ensure a custom experiment_directory exists.
upload launch artifacts: write training_config.yaml (1 MB cap), command.sh, requirements.yaml (file or synthesized from inline deps), env_vars.json / secret_env_vars.json, and hyperparameters.yaml into the launch dir via a workspace filer.
assemble + submit: build the native ai_runtime_task payload and POST /api/2.2/jobs/runs/submit directly, then print the run id + dashboard URL (or a JSON envelope).

Submission uses the native ai_runtime_task task (BYOT task type) and it talks only to the Jobs API (which internally routes to training service endpoint) and has no genai-mapi forwarding (the MAPI path is deprecated). It isn't modeled by the typed SDK in go, so the payload is a custom struct posted to the raw endpoint. The proto is lean: env vars and secrets ship as co-located env_vars.json / secret_env_vars.json files rather than inline, and requirements.yaml / hyperparameters.yaml are derived server-side from the command directory.

Deferred, with explicit "not yet supported" errors (no silent drops): code_source snapshot packaging, --watch log streaming, and usage_policy_name. environment.docker_image is accepted by the schema as scaffolding but not conveyed in the payload (the native path has no docker field). node_pool_id / pool_name / priority remain dropped (new AIR CLI does not support pool placement).

Why

air run is the core of the migration for AIR CLI. Splitting it into per-phase commits keeps each reviewable in isolation, and stacking on the schema PR keeps that PR focused. Regarding some specific decisions:

We maintain the native ai_runtime_task (and not the genai_compute_task interfacing with mapi) as a hand built struct posted to the raw endpoint. This is so that we can interface with jobs directly (and jobs.SubmitTask only knows gen_ai_compute_task and this typed struct also omits the env-vars/secrets/requirements fields that are needed for the run) and make sure we also stay off the deprecated genai-mapi forwarding path.
--dry-run is decoupled from auth. It validates the config locally and returns before any workspace call, so config validation works fully offline (matching the Python CLI). Only actual submission requires an authenticated workspace client.

Tests

Unit tests for every phase: launch accessors, pre-submit resolution (incl. ensureExperimentDirectory create/exists/not-a-directory), artifact assembly + upload, payload assembly, and submitWorkload end-to-end against a fake workspace.
New acceptance/experimental/air/run test covering --dry-run (text + JSON), the --override/--watch guards, an invalid config, and missing --file.
Updated the unimplemented acceptance test (removed run, now implemented).

go test ./experimental/air/..., go test ./acceptance -run TestAccept/experimental/air, and ./task lint-q all pass.

Manual verification tests (all pass):

Dry run (offline, no auth)

command only

full run config

json output

actual run submission

throws error when profile is not set

submission loop: submitted, can see the run in air list and air get and mlflow environment was created

same run id gets ouputted when run submitted with the SAME idempotency key

new run gets created when run submitted with SAME config but DIFFERENT idempotency key

--watch and --override return an informative error message (since they are not supported yet, but are valid flags)
usage_policy_name set in config throws error: usage_policy_name is not yet supported
code_source set in config throws error: code_source is not yet supported
missing --file throws informative error: required flag(s) "file" not set
invalid config (e.g. experiment_name: bad.name, or num_accelerators not a multiple of the per-node count) throws field-specific validation error

How to test locally for manual verification:

Checkout & build:

git fetch origin
git checkout air-integration-m2-3        # this PR (stacked on air-integration-m2-2)
./task build

Sample configs:

cat > /tmp/min.yaml <<'YAML'
experiment_name: air-cuj
command: python train.py
compute: {accelerator_type: GPU_1xH100, num_accelerators: 1}
YAML

cat > /tmp/full.yaml <<'YAML'
experiment_name: full-run
command: |
  pip install -r requirements.txt
  python train.py
compute: {accelerator_type: GPU_8xH100, num_accelerators: 16}
environment: {dependencies: [torch==2.3.0], version: 5}
env_variables: {WANDB_PROJECT: demo}
secrets: {HF_TOKEN: my_scope/hf_token}
parameters: {lr: 0.001, epochs: 3}
mlflow_run_name: full-run-v2
max_retries: 2
timeout_minutes: 120
YAML

Automated tests

go test ./experimental/air/...                      # unit (incl. submitWorkload vs a fake workspace)
go test ./acceptance -run TestAccept/experimental/air   # acceptance (run + unimplemented)
./task lint-q                                        # lint changed files

Dry run:

./cli experimental air run -f /tmp/min.yaml --dry-run   
# note that this command will, in the final version, be databricks experimental air run 
./cli experimental air run -f /tmp/full.yaml --dry-run
./cli experimental air run -f /tmp/min.yaml --dry-run -o json

Actual run submission:

PROFILE=<your-dev-profile>

# no auth configured → fails fast (exit 1)
env -u DATABRICKS_HOST -u DATABRICKS_TOKEN ./cli experimental air run -f /tmp/min.yaml
#> Error: ... (cannot configure default credentials / auth)

# submit → prints run_id + dashboard URL
./cli experimental air run -f /tmp/min.yaml -p $PROFILE -o json
#> { "data": { "status":"SUBMITTED", "run_id":"<id>", "dashboard_url":"<host>/jobs/runs/<id>" } }

# verify in the workspace: open dashboard_url (run exists), and the MLflow experiment was created.
./cli experimental air get <run_id> -p $PROFILE         # run state
./cli experimental air list -p $PROFILE                 # run appears in the list

# idempotency — SAME key returns the SAME run_id (no new run)
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-1 -o json   # run_id = X
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-1 -o json   # run_id = X (same)

# idempotency — DIFFERENT key creates a NEW run
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-2 -o json   # run_id = Y (new)

Unsupported flags (asserting that error is thrown):

./cli experimental air run -f /tmp/min.yaml --dry-run --watch
#> Error: --watch is not yet supported
./cli experimental air run -f /tmp/min.yaml --dry-run --override compute.num_accelerators=8
#> Error: --override is not yet supported

# usage_policy_name (needs a workspace to reach the submit guard)
printf 'experiment_name: t\ncommand: x\ncompute: {accelerator_type: GPU_1xH100, num_accelerators: 1}\nusage_policy_name: my-policy\n' > /tmp/policy.yaml
./cli experimental air run -f /tmp/policy.yaml -p $PROFILE
#> Error: usage_policy_name is not yet supported

# code_source
printf 'experiment_name: t\ncommand: x\ncompute: {accelerator_type: GPU_1xH100, num_accelerators: 1}\ncode_source: {type: snapshot, snapshot: {root_path: .}}\n' > /tmp/code.yaml
air run -f /tmp/code.yaml -p $PROFILE
#> Error: code_source is not yet supported

Validation errors for field-specific message (exit 1, offline):

# missing --file
air run --dry-run
#> Error: required flag(s) "file" not set

# invalid experiment_name + num_accelerators not a multiple of the per-node count
printf 'experiment_name: bad.name\ncommand: x\ncompute: {accelerator_type: GPU_8xH100, num_accelerators: 3}\n' > /tmp/bad.yaml
air run -f /tmp/bad.yaml --dry-run
#> Error: invalid experiment_name "bad.name": only alphanumeric characters, hyphens (-), and underscores (_) are allowed
#  (and, once the name is fixed: compute.num_accelerators for GPU_8xH100 must be a multiple of 8, got 3)

github-actions · 2026-06-24T17:53:36Z

Approval status: pending

`/acceptance/experimental/air/` - needs approval

8 files changed
Eligible: @apeforest, @bfontain, @lu-wang-dl, @panchalhp-db, @vinchenzo-db, @maggiewang-db, @ben-hansen-db, @pardis-beikzadeh-db

`/experimental/air/` - needs approval

10 files changed
Eligible: @apeforest, @bfontain, @lu-wang-dl, @panchalhp-db, @vinchenzo-db, @maggiewang-db, @ben-hansen-db, @pardis-beikzadeh-db

_{Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.

See OWNERS for ownership rules.}

eng-dev-ecosystem-bot · 2026-06-24T18:44:41Z

Integration test report

Commit: a5d851b

Run: 28137799905

	Env	❌FAIL	🔄flaky	💚RECOVERED	🙈SKIP	✅pass	🙈skip	Time
💚	aws linux			8	13	263	1016	5:40
❌	aws windows	1	1	8	13	263	1014	15:06
💚	aws-ucws linux			8	13	359	930	6:18
💚	aws-ucws windows			8	13	361	928	7:13
💚	azure linux			2	15	266	1014	4:57
💚	azure windows			2	15	268	1012	6:33
💚	azure-ucws linux			2	15	364	926	6:19
💚	azure-ucws windows			2	15	366	924	6:55
💚	gcp linux			2	15	262	1017	5:12
💚	gcp windows			2	15	264	1015	6:57

23 interesting tests: 13 SKIP, 8 RECOVERED, 1 FAIL, 1 flaky

	Test Name	aws linux	aws windows	aws-ucws linux	aws-ucws windows	azure linux	azure windows	azure-ucws linux	azure-ucws windows	gcp linux	gcp windows
💚	TestAccept	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/invariant/no_drift	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/permissions	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/resources/postgres_branches/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/replace_existing	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/update_protected	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/without_branch_id	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_projects/update_display_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/synced_database_tables/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/ssh/connection	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
❌	TestSecretsPutSecretBytesValue	✅p	❌F	🙈s	🙈s	✅p	✅p	✅p	✅p	✅p	✅p
🔄	TestSecretsPutSecretStringValue	✅p	🔄f	🙈s	🙈s	✅p	✅p	✅p	✅p	✅p	✅p
💚	TestFetchRepositoryInfoAPI_FromRepo	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R

Top 21 slowest tests (at least 2 minutes):

duration	env	testname
8:20	aws windows	TestSecretsPutSecretStringValue
5:06	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:17	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:16	gcp windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:06	gcp linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:30	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:12	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:12	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:11	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:06	aws-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:47	aws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:40	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:40	aws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:39	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:37	azure windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32	aws-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:32	azure linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:32	azure-ucws windows	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:18	azure-ucws linux	TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Flatten the validated runConfig schema into the derived values the launch path consumes (timeout seconds, retry default, docker image URL, requirements file vs inline dependencies, runtime version), replacing the Python CLI's _convert_to_run_config step. handle_run reads runConfig directly, so these are accessors rather than a separate internal config type. Co-authored-by: Isaac

Wire `air run`'s RunE to load and structurally validate the YAML config, and implement --dry-run (validate without submitting). The non-dry-run submission path returns "not implemented" until the submit phase lands; --override is rejected with a clear error since the override pipeline is not ported yet. Drop `run` from the not-implemented stub test now that it does real work. Co-authored-by: Isaac

Resolve the workspace context air run needs before uploading and submitting: the current user, the per-user workspace home (with env override), a unique cli_launch directory for a run's artifacts, the MLflow experiment path, and ensuring a custom experiment_directory exists (created if missing, matching the CLI's convention for its other artifact directories). Co-authored-by: Isaac

Assemble and upload the launch artifacts for a run into its cli_launch directory: the merged config (training_config.yaml, 1 MB cap), the inline command as command.sh, requirements.yaml (from a file or synthesized from inline dependencies), and hyperparameters.yaml. buildArtifacts is pure; the upload writes through a narrow fileWriter (a workspace filer in production). A TODO(DABs) marks the client-side upload path as a future candidate for reuse of DABs' file-staging (libs/sync / bundle deploy). Co-authored-by: Isaac

Wire `air run` end to end: ensure the experiment directory, upload launch artifacts, build the native ai_runtime_task payload, and submit it via a direct POST to /api/2.2/jobs/runs/submit. The ai_runtime_task routes straight to the training service with no genai-mapi forwarding — the MAPI path is deprecated. The proto is lean: env vars and secrets are staged as co-located env_vars.json / secret_env_vars.json workspace files rather than inline, and requirements / hyperparameters are derived server-side from the command directory. The non-dry-run path resolves the workspace context, uploads, submits, and prints the run id + dashboard URL. usage_policy_name, code_source snapshots, and --watch are rejected with clear errors until their phases land. environment.docker_image is accepted by the schema as scaffolding but not conveyed (the native path has no docker field). Co-authored-by: Isaac

riddhibhagwat-db requested a review from maggiewang-db June 24, 2026 17:53

riddhibhagwat-db temporarily deployed to test-trigger-is June 24, 2026 17:53 — with GitHub Actions Inactive

riddhibhagwat-db changed the title ~~AIR Integration:~~ AIR Integration: air run end to end command Jun 24, 2026

riddhibhagwat-db self-assigned this Jun 24, 2026

riddhibhagwat-db changed the title ~~AIR Integration: air run end to end command~~ AIR CLI Integration: air run end to end command Jun 24, 2026

riddhibhagwat-db force-pushed the air-integration-m2-2 branch from 373988d to 226d41a Compare June 25, 2026 00:01

riddhibhagwat-db added 5 commits June 25, 2026 00:02

riddhibhagwat-db force-pushed the air-integration-m2-3 branch from c6edcc2 to a5d851b Compare June 25, 2026 00:06

riddhibhagwat-db temporarily deployed to test-trigger-is June 25, 2026 00:06 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AIR CLI Integration: `air run` end to end command #5710

AIR CLI Integration: `air run` end to end command #5710
riddhibhagwat-db wants to merge 5 commits into
air-integration-m2-2from
air-integration-m2-3

riddhibhagwat-db commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

eng-dev-ecosystem-bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

riddhibhagwat-db commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Why

Tests

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approval status: pending

/acceptance/experimental/air/ - needs approval

/experimental/air/ - needs approval

Uh oh!

eng-dev-ecosystem-bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration test report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

riddhibhagwat-db commented Jun 24, 2026 •

edited

Loading

github-actions Bot commented Jun 24, 2026 •

edited

Loading

`/acceptance/experimental/air/` - needs approval

`/experimental/air/` - needs approval

eng-dev-ecosystem-bot commented Jun 24, 2026 •

edited

Loading