Skip to content

bundle/fuzz: add create-payload parity fuzz test for terraform vs direct#5686

Open
radakam wants to merge 14 commits into
mainfrom
deco-25361-fuzz-create-payload
Open

bundle/fuzz: add create-payload parity fuzz test for terraform vs direct#5686
radakam wants to merge 14 commits into
mainfrom
deco-25361-fuzz-create-payload

Conversation

@radakam

@radakam radakam commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

A test-only bundle/fuzz package that randomly generates job configs, deploys each under both the terraform and direct engines against an in-process testserver, and diffs the captured jobs/create payloads. Seeds are deterministic, so any divergence reproduces from the printed FUZZ_SEED=<n>.

Runs nightly (task test-fuzz), opt-in via FUZZ_* and excluded from per-PR/merge-queue checks since each seed runs two real deploys; failures surface as a deduped GitHub issue. Jobs only for now.

Why

We need confidence the direct engine produces the same API payloads as terraform during migration. This gives systematic, reproducible coverage and a nightly drift signal.

Divergences found

  1. num_workers on single-node task clusters (seed 29): terraform force-sends num_workers: 0 where direct omitted it. Fixed: prepareJobSettingsForUpdate now applies initializeNumWorkers to task new_cluster too, not just shared job_clusters. Seed 29 stays in regressionSeeds to guard against regression.
  2. spark.databricks.delta.preview.enabled: terraform strips this deprecated spark_conf key; direct forwards it. The backend ignores it either way, so it stays in defaultIgnorePaths (benign, no fix needed).

Testing

  • go test ./bundle/fuzz/... and ./bundle/config/mutator/resourcemutator/... — diff engine, generators, and the num_workers fix.
  • task test-fuzz — full terraform-vs-direct parity run (nightly).
  • Regenerated affected acceptance outputs, now showing num_workers: 0 on single-node task clusters under both engines.

Implements the first technique from DECO-25361: generate random job
configs and check for differences in the create payload between the
terraform and direct deploy engines.

Both engines run the same `bundle deploy` pipeline in-process (via
testcli) against a testserver, differing only in DATABRICKS_BUNDLE_ENGINE,
and the POST /api/2.2/jobs/create body each sends is captured and diffed.
Because only the engine differs, shared mutators cancel out and any
remaining diff is a genuine engine divergence.

The fuzzer already surfaced two real (benign) divergences, documented in
DefaultIgnorePaths:
  - num_workers: 0 is sent explicitly by terraform but dropped by direct
    (omitempty).
  - the terraform provider strips the deprecated spark conf
    "spark.databricks.delta.preview.enabled"; direct forwards it.

Run with: go test ./bundle/fuzz -run TestJobCreateParity
(FUZZ_SEEDS overrides the seed count; auto-skips when terraform is not
provisioned via acceptance/install_terraform.py).
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 07:44 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 07:44 — with GitHub Actions Inactive
…ignore

Address golangci-lint failures (intrange loops, strconv.FormatBool over
fmt.Sprintf) and tighten the create-payload ignore list: drop the dead
job_clusters num_workers entry (those are at parity) and document the
task-level num_workers divergence as a real CLI gap to fix separately.
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 08:06 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 08:06 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: bdddd24

Run: 28248869110

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 1 13 235 1033 5:20
🟨​ aws windows 7 1 13 237 1031 9:19
💚​ aws-ucws linux 8 13 322 950 4:54
🔄​ aws-ucws windows 1 8 13 323 948 4:34
💚​ azure linux 2 15 235 1032 4:24
💚​ azure windows 2 15 237 1030 3:50
💚​ azure-ucws linux 2 15 324 947 5:16
💚​ azure-ucws windows 2 15 326 945 4:11
💚​ gcp linux 2 15 234 1034 3:52
💚​ gcp windows 2 15 236 1032 5:21
22 interesting tests: 13 SKIP, 7 KNOWN, 1 flaky, 1 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestSyncIncrementalFileOverwritesFolder ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 4 slowest tests (at least 2 minutes):
duration env testname
4:08 gcp windows TestAccept
2:42 azure-ucws windows TestAccept
2:39 azure windows TestAccept
2:35 aws-ucws windows TestAccept

- Add a `test-fuzz` task and a nightly CI job that provisions terraform
  and runs the create-payload parity tests. They previously always
  skipped because terraform was never provisioned in the test path.
- Ignore repo-root build/ so the provisioned terraform binary and
  provider mirror are not accidentally committed.
- Skip cleanly when build/ is only partially provisioned (missing
  provider mirror or .terraformrc) instead of failing mid-deploy.
- Document that the harness covers jobs only for now (DECO-25361).
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 09:47 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 23, 2026 09:47 — with GitHub Actions Inactive
Make the create-payload parity fuzz suite explore new configs over time and
be reproducible from a reported seed:

- FUZZ_SEED (comma-separated) runs exactly those seeds, overriding the range,
  so a reported divergence reproduces with one command. The failure message
  now prints this knob.
- FUZZ_SEED_OFFSET shifts the deterministic window; push.yml derives it from
  GITHUB_RUN_NUMBER so each nightly run checks seeds it has never tested
  before instead of re-checking a fixed set. Windows are non-overlapping
  because the run number is unique and monotonic.
- Guard FUZZ_SEEDS > 0 so a negative value no longer panics make() and zero
  no longer passes as a no-op.
- Drop the test-fuzz Task sources fingerprint: the seeds depend on env vars
  Task can't see, so skipping on an unchanged checksum would silently no-op a
  repro run or a shifted window.
- Keep the nightly window modest (25); exploration comes from rotation, not
  size, and it can be raised once nightly timings are known.
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 08:29 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 08:29 — with GitHub Actions Inactive
radakam added 2 commits June 24, 2026 12:03
The terraform provider force-sends num_workers: 0 for a single-node
new_cluster (no autoscale) on both job_clusters and task-level clusters,
but JobClustersFixups only applied initializeNumWorkers to job_clusters.
The direct engine therefore omitted num_workers on task clusters, so the
two engines produced divergent create payloads. This divergence was
surfaced by the bundle/fuzz parity harness.

Apply initializeNumWorkers to task new_cluster too so the direct engine
matches terraform, and drop the now-obsolete tasks[*].new_cluster.num_workers
entry from the fuzz DefaultIgnorePaths.
The nightly test-fuzz job is intentionally excluded from test-result, so
a failure was only visible in the Actions tab. Add a failure step that
opens (or comments on) a single deduped GitHub issue with a one-command
repro.

Also correct the jobsCreatePath comment: a different API version shows up
as a capture failure (the testserver registers only this route, so a
mismatched version 404s and the deploy fails), not as a payload diff.
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 12:05 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 12:05 — with GitHub Actions Inactive
…ion test

Rename the capture/deploy/recorder helpers to *_test.go so the parity
harness compiles only under `go test` instead of into the package's
regular build, and add a committed regression test (cluster_fixups_test.go)
covering the single-node task-cluster num_workers force-send fix so the
divergence is guarded at PR time, not just in the nightly suite.
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 13:27 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 13:27 — with GitHub Actions Inactive
…ting

Move the remaining generator/diff/rand implementation into _test.go files
(keeping only a doc.go for the package comment) so nothing in the harness
compiles into the regular build, since no product code imports it.

Distinguish deploy/capture failures from create-payload divergences in
checkJobParity: skip when neither engine deploys the generated config, fail
distinctly when exactly one engine accepts it (an acceptance divergence, not
a payload diff), and only diff payloads when both deploys succeed. This keeps
nightly triage from misdirecting a deploy failure into regressionSeeds.

Also document the unique-identity-key assumption in diffKeyedSlice.
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 13:35 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 24, 2026 13:35 — with GitHub Actions Inactive
Use strings.SplitSeq instead of ranging over strings.Split (modernize
stringsseq) and require.Positivef instead of require.Greaterf(t, n, 0)
(testifylint negative-positive).
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 11:42 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 11:42 — with GitHub Actions Inactive
The failure-reporting step used `gh issue list --jq '.[0].number'`, which
prints the literal "null" when no open issue exists, so it always took the
comment branch and tried to comment on issue "null" instead of creating one.
Use `// empty` so the create branch runs on the first divergence.
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 11:54 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 11:54 — with GitHub Actions Inactive
Revert the num_workers single-node task-cluster fix along with its unit
test and acceptance updates so this PR adds only the parity harness.

Both terraform/direct divergences the harness found are now documented and
suppressed via DefaultIgnorePaths rather than fixed (fixes follow
separately): num_workers on single-node task clusters (seed 29) and the
spark.databricks.delta.preview.enabled spark conf key.
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 17:47 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 25, 2026 17:47 — with GitHub Actions Inactive
Address review feedback on the create-payload parity harness:

- Replace the path-only ignore list with value-conditional ignore rules so
  the documented num_workers divergence (direct omits, terraform force-sends
  0) is suppressed only for that exact shape; a real value mismatch at the
  same path now fails again.
- Unexport package-internal identifiers (generateJob, diffPayloads,
  difference, defaultIgnoreRules) that are only used within the package.
- Document why TestCaptureJobCreateDirect is intentionally not opt-in.
- Reword the one-sided-deploy failures as deploy/capture differences rather
  than asserting one engine "rejected" the config.
- Make TestParitySeeds hermetic against ambient FUZZ_* env vars.
- Correct the seed 29 comment to reflect that the divergence is suppressed.
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 07:42 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 07:42 — with GitHub Actions Inactive
The terraform provider force-sends num_workers:0 for a single-node
new_cluster on task-level clusters too, not just shared job_clusters, but
prepareJobSettingsForUpdate only applied initializeNumWorkers to
job_clusters. The direct engine therefore omitted num_workers on task
clusters and the two engines produced divergent create payloads (found by
the bundle/fuzz parity harness, seed 29).

Apply initializeNumWorkers to task new_cluster too so the direct engine
matches terraform, drop the now-obsolete tasks[*].new_cluster.num_workers
ignore entry, and simplify the fuzz ignore list to a plain []string now
that value-conditional matching is no longer needed.
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 08:22 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 08:22 — with GitHub Actions Inactive
@radakam radakam marked this pull request as ready for review June 26, 2026 08:34
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Approval status: pending

/bundle/ - needs approval

10 files changed
Suggested: @pietern
Also eligible: @denik, @janniklasrose, @shreyas-goenka, @anton-107, @andrewnester, @lennartkats-db

General files (require maintainer)

Files: .github/workflows/push.yml, Taskfile.yml
Based on git history:

  • @pietern -- recent work in .github/workflows/, ./

Any maintainer (@andrewnester, @anton-107, @denik, @pietern, @shreyas-goenka, @simonfaltum, @renaudhartert-db) can approve all areas.
See OWNERS for ownership rules.

Comment thread bundle/config/mutator/resourcemutator/cluster_fixups.go Outdated
// tends to differ between engines.
//
// TODO(DECO-25361): generalize the harness across resource kinds.
func generateJob(rng *rand.Rand) *resources.Job {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach.

I was thinking more of using bundle schema so we can fuzz anything, not just jobs.

Switch the fuzz suite from comparing terraform and direct create payloads to
asserting invariants on the direct engine's payload. Terraform and direct can
disagree for legitimate reasons, so a payload diff is noisy; an invariant has no
legitimate reason to fail, so a failure is a real bug. This drops the payload
diff and its ignore-list of documented divergences, and removes terraform from
the harness (each seed is now one in-process direct deploy).

Gate on `bundle validate` so the suite distinguishes the two fuzzing outcomes:
an invalid config skips (it can't violate an invariant), while a validated config
that fails to deploy or breaks an invariant fails. This is the distinction a
looser, schema-driven generator will rely on.

Revert the num_workers:0 force-send for single-node task clusters (and its
acceptance goldens): it only matched terraform's payload, with no demonstrated
behavior benefit, and direct has shipped without it. If a real backend
requirement is confirmed, it can return as a standalone change.
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 15:45 — with GitHub Actions Inactive
@radakam radakam temporarily deployed to test-trigger-is June 26, 2026 15:45 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants