Skip to content

AIR CLI Integration: air run end to end command #5710

Open
riddhibhagwat-db wants to merge 5 commits into
air-integration-m2-2from
air-integration-m2-3
Open

AIR CLI Integration: air run end to end command #5710
riddhibhagwat-db wants to merge 5 commits into
air-integration-m2-2from
air-integration-m2-3

Conversation

@riddhibhagwat-db

@riddhibhagwat-db riddhibhagwat-db commented Jun 24, 2026

Copy link
Copy Markdown

Changes

Implements the air run happy path on top of the config schema (#5657), submitting a one-time training run through the Jobs API. Five commits, one per phase:

  1. run config launch accessors: flatten the validated config into launch values (timeout seconds, retry default, requirements file-vs-inline, runtime version).
  2. wire run command (load, validate, dry-run): air run -f loads + structurally validates the YAML; --dry-run validates offline (no workspace/auth) and returns; --override/--watch are rejected for now with clear errors (ported in future PR).
  3. pre-submit resolution: resolve current user / workspace home / a unique cli_launch dir, and ensure a custom experiment_directory exists.
  4. upload launch artifacts: write training_config.yaml (1 MB cap), command.sh, requirements.yaml (file or synthesized from inline deps), env_vars.json / secret_env_vars.json, and hyperparameters.yaml into the launch dir via a workspace filer.
  5. assemble + submit: build the native ai_runtime_task payload and POST /api/2.2/jobs/runs/submit directly, then print the run id + dashboard URL (or a JSON envelope).

Submission uses the native ai_runtime_task task (BYOT task type) and it talks only to the Jobs API (which internally routes to training service endpoint) and has no genai-mapi forwarding (the MAPI path is deprecated). It isn't modeled by the typed SDK in go, so the payload is a custom struct posted to the raw endpoint. The proto is lean: env vars and secrets ship as co-located env_vars.json / secret_env_vars.json files rather than inline, and requirements.yaml / hyperparameters.yaml are derived server-side from the command directory.

Deferred, with explicit "not yet supported" errors (no silent drops): code_source snapshot packaging, --watch log streaming, and usage_policy_name. environment.docker_image is accepted by the schema as scaffolding but not conveyed in the payload (the native path has no docker field). node_pool_id / pool_name / priority remain dropped (new AIR CLI does not support pool placement).

Why

air run is the core of the migration for AIR CLI. Splitting it into per-phase commits keeps each reviewable in isolation, and stacking on the schema PR keeps that PR focused. Regarding some specific decisions:

  • We maintain the native ai_runtime_task (and not the genai_compute_task interfacing with mapi) as a hand built struct posted to the raw endpoint. This is so that we can interface with jobs directly (and jobs.SubmitTask only knows gen_ai_compute_task and this typed struct also omits the env-vars/secrets/requirements fields that are needed for the run) and make sure we also stay off the deprecated genai-mapi forwarding path.
  • --dry-run is decoupled from auth. It validates the config locally and returns before any workspace call, so config validation works fully offline (matching the Python CLI). Only actual submission requires an authenticated workspace client.

Tests

  • Unit tests for every phase: launch accessors, pre-submit resolution (incl. ensureExperimentDirectory create/exists/not-a-directory), artifact assembly + upload, payload assembly, and submitWorkload end-to-end against a fake workspace.
  • New acceptance/experimental/air/run test covering --dry-run (text + JSON), the --override/--watch guards, an invalid config, and missing --file.
  • Updated the unimplemented acceptance test (removed run, now implemented).

go test ./experimental/air/..., go test ./acceptance -run TestAccept/experimental/air, and ./task lint-q all pass.

Manual verification tests (all pass):

  • Dry run (offline, no auth)
  • command only
  • full run config
  • json output
  • actual run submission
  • throws error when profile is not set
  • submission loop: submitted, can see the run in air list and air get and mlflow environment was created
  • same run id gets ouputted when run submitted with the SAME idempotency key
  • new run gets created when run submitted with SAME config but DIFFERENT idempotency key
  • --watch and --override return an informative error message (since they are not supported yet, but are valid flags)
  • usage_policy_name set in config throws error: usage_policy_name is not yet supported
  • code_source set in config throws error: code_source is not yet supported
  • missing --file throws informative error: required flag(s) "file" not set
  • invalid config (e.g. experiment_name: bad.name, or num_accelerators not a multiple of the per-node count) throws field-specific validation error

How to test locally for manual verification:

Checkout & build:

git fetch origin
git checkout air-integration-m2-3        # this PR (stacked on air-integration-m2-2)
./task build

Sample configs:

cat > /tmp/min.yaml <<'YAML'
experiment_name: air-cuj
command: python train.py
compute: {accelerator_type: GPU_1xH100, num_accelerators: 1}
YAML
cat > /tmp/full.yaml <<'YAML'
experiment_name: full-run
command: |
  pip install -r requirements.txt
  python train.py
compute: {accelerator_type: GPU_8xH100, num_accelerators: 16}
environment: {dependencies: [torch==2.3.0], version: 5}
env_variables: {WANDB_PROJECT: demo}
secrets: {HF_TOKEN: my_scope/hf_token}
parameters: {lr: 0.001, epochs: 3}
mlflow_run_name: full-run-v2
max_retries: 2
timeout_minutes: 120
YAML

Automated tests

go test ./experimental/air/...                      # unit (incl. submitWorkload vs a fake workspace)
go test ./acceptance -run TestAccept/experimental/air   # acceptance (run + unimplemented)
./task lint-q                                        # lint changed files

Dry run:

./cli experimental air run -f /tmp/min.yaml --dry-run   
# note that this command will, in the final version, be databricks experimental air run 
./cli experimental air run -f /tmp/full.yaml --dry-run
./cli experimental air run -f /tmp/min.yaml --dry-run -o json

Actual run submission:

PROFILE=<your-dev-profile>

# no auth configured → fails fast (exit 1)
env -u DATABRICKS_HOST -u DATABRICKS_TOKEN ./cli experimental air run -f /tmp/min.yaml
#> Error: ... (cannot configure default credentials / auth)

# submit → prints run_id + dashboard URL
./cli experimental air run -f /tmp/min.yaml -p $PROFILE -o json
#> { "data": { "status":"SUBMITTED", "run_id":"<id>", "dashboard_url":"<host>/jobs/runs/<id>" } }

# verify in the workspace: open dashboard_url (run exists), and the MLflow experiment was created.
./cli experimental air get <run_id> -p $PROFILE         # run state
./cli experimental air list -p $PROFILE                 # run appears in the list

# idempotency — SAME key returns the SAME run_id (no new run)
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-1 -o json   # run_id = X
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-1 -o json   # run_id = X (same)

# idempotency — DIFFERENT key creates a NEW run
./cli experimental air run -f /tmp/min.yaml -p $PROFILE --idempotency-key demo-key-2 -o json   # run_id = Y (new)

Unsupported flags (asserting that error is thrown):

./cli experimental air run -f /tmp/min.yaml --dry-run --watch
#> Error: --watch is not yet supported
./cli experimental air run -f /tmp/min.yaml --dry-run --override compute.num_accelerators=8
#> Error: --override is not yet supported

# usage_policy_name (needs a workspace to reach the submit guard)
printf 'experiment_name: t\ncommand: x\ncompute: {accelerator_type: GPU_1xH100, num_accelerators: 1}\nusage_policy_name: my-policy\n' > /tmp/policy.yaml
./cli experimental air run -f /tmp/policy.yaml -p $PROFILE
#> Error: usage_policy_name is not yet supported

# code_source
printf 'experiment_name: t\ncommand: x\ncompute: {accelerator_type: GPU_1xH100, num_accelerators: 1}\ncode_source: {type: snapshot, snapshot: {root_path: .}}\n' > /tmp/code.yaml
air run -f /tmp/code.yaml -p $PROFILE
#> Error: code_source is not yet supported

Validation errors for field-specific message (exit 1, offline):

# missing --file
air run --dry-run
#> Error: required flag(s) "file" not set

# invalid experiment_name + num_accelerators not a multiple of the per-node count
printf 'experiment_name: bad.name\ncommand: x\ncompute: {accelerator_type: GPU_8xH100, num_accelerators: 3}\n' > /tmp/bad.yaml
air run -f /tmp/bad.yaml --dry-run
#> Error: invalid experiment_name "bad.name": only alphanumeric characters, hyphens (-), and underscores (_) are allowed
#  (and, once the name is fixed: compute.num_accelerators for GPU_8xH100 must be a multiple of 8, got 3)

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/acceptance/experimental/air/ - needs approval

8 files changed
Eligible: @apeforest, @bfontain, @lu-wang-dl, @panchalhp-db, @vinchenzo-db, @maggiewang-db, @ben-hansen-db, @pardis-beikzadeh-db

/experimental/air/ - needs approval

10 files changed
Eligible: @apeforest, @bfontain, @lu-wang-dl, @panchalhp-db, @vinchenzo-db, @maggiewang-db, @ben-hansen-db, @pardis-beikzadeh-db

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

@riddhibhagwat-db riddhibhagwat-db changed the title AIR Integration: AIR Integration: air run end to end command Jun 24, 2026
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: a5d851b

Run: 28137799905

Env ❌​FAIL 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 8 13 263 1016 5:40
❌​ aws windows 1 1 8 13 263 1014 15:06
💚​ aws-ucws linux 8 13 359 930 6:18
💚​ aws-ucws windows 8 13 361 928 7:13
💚​ azure linux 2 15 266 1014 4:57
💚​ azure windows 2 15 268 1012 6:33
💚​ azure-ucws linux 2 15 364 926 6:19
💚​ azure-ucws windows 2 15 366 924 6:55
💚​ gcp linux 2 15 262 1017 5:12
💚​ gcp windows 2 15 264 1015 6:57
23 interesting tests: 13 SKIP, 8 RECOVERED, 1 FAIL, 1 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
❌​ TestSecretsPutSecretBytesValue ✅​p ❌​F 🙈​s 🙈​s ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestSecretsPutSecretStringValue ✅​p 🔄​f 🙈​s 🙈​s ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 21 slowest tests (at least 2 minutes):
duration env testname
8:20 aws windows TestSecretsPutSecretStringValue
5:06 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:17 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:16 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:06 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:30 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:12 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:12 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:11 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:06 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:47 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:40 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:40 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:39 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:37 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:32 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:32 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:18 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

@riddhibhagwat-db riddhibhagwat-db self-assigned this Jun 24, 2026
@riddhibhagwat-db riddhibhagwat-db changed the title AIR Integration: air run end to end command AIR CLI Integration: air run end to end command Jun 24, 2026
Flatten the validated runConfig schema into the derived values the launch
path consumes (timeout seconds, retry default, docker image URL, requirements
file vs inline dependencies, runtime version), replacing the Python CLI's
_convert_to_run_config step. handle_run reads runConfig directly, so these are
accessors rather than a separate internal config type.

Co-authored-by: Isaac
Wire `air run`'s RunE to load and structurally validate the YAML config, and
implement --dry-run (validate without submitting). The non-dry-run submission
path returns "not implemented" until the submit phase lands; --override is
rejected with a clear error since the override pipeline is not ported yet.

Drop `run` from the not-implemented stub test now that it does real work.

Co-authored-by: Isaac
Resolve the workspace context air run needs before uploading and submitting:
the current user, the per-user workspace home (with env override), a unique
cli_launch directory for a run's artifacts, the MLflow experiment path, and
ensuring a custom experiment_directory exists (created if missing, matching the
CLI's convention for its other artifact directories).

Co-authored-by: Isaac
Assemble and upload the launch artifacts for a run into its cli_launch
directory: the merged config (training_config.yaml, 1 MB cap), the inline
command as command.sh, requirements.yaml (from a file or synthesized from
inline dependencies), and hyperparameters.yaml. buildArtifacts is pure; the
upload writes through a narrow fileWriter (a workspace filer in production).

A TODO(DABs) marks the client-side upload path as a future candidate for
reuse of DABs' file-staging (libs/sync / bundle deploy).

Co-authored-by: Isaac
Wire `air run` end to end: ensure the experiment directory, upload launch
artifacts, build the native ai_runtime_task payload, and submit it via a direct
POST to /api/2.2/jobs/runs/submit. The ai_runtime_task routes straight to the
training service with no genai-mapi forwarding — the MAPI path is deprecated.

The proto is lean: env vars and secrets are staged as co-located env_vars.json /
secret_env_vars.json workspace files rather than inline, and requirements /
hyperparameters are derived server-side from the command directory. The
non-dry-run path resolves the workspace context, uploads, submits, and prints
the run id + dashboard URL. usage_policy_name, code_source snapshots, and
--watch are rejected with clear errors until their phases land.
environment.docker_image is accepted by the schema as scaffolding but not
conveyed (the native path has no docker field).

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants