feat(iac): add regional orchestrator cache#2905
Conversation
PR SummaryHigh Risk Overview Regional placement & NFS: Cross-region join: Startup scripts pass primary-region Nomad region, Consul datacenter, and a broader GCE retry-join zone pattern so remote clients discover the control plane; Docker auth can include the primary Artifact Registry region when it differs from the VM region. Agent/runtime tweaks: Nomad Reviewed by Cursor Bugbot for commit d9d99b3. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 3 Tests Failed:
View the full list of 3 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
When using REGIONAL or ENTERPRISE tiers for GCP Filestore, the location must be a region rather than a zone. Passing a zone like filestore_zone in regional_filestore will cause a deployment failure for these tiers, so the location should be conditionally set to the region if a regional tier is selected.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| name = "${var.prefix}shared-disk-store-${each.key}" | ||
| network_name = var.network_name | ||
| location = each.value.filestore_zone |
There was a problem hiding this comment.
When using REGIONAL or ENTERPRISE tiers for GCP Filestore, the location must be a region rather than a zone. Passing a zone like filestore_zone will cause a deployment failure for these tiers. Conditionally set the location to the region (each.key) if a regional tier is selected, and default to the zone (each.value.filestore_zone) otherwise.
location = contains(["REGIONAL", "ENTERPRISE"], var.filestore_cache_tier) ? each.key : each.value.filestore_zone
This reverts commit 34dcd14.
Allow secondary client clusters to pin regional MIG placement to explicit zones without changing build or primary client pools. Co-authored-by: Cursor <cursoragent@cursor.com>
| variable "nomad_version" { | ||
| type = string | ||
| default = "1.8.4" | ||
| default = "1.6.2" |
There was a problem hiding this comment.
Packer defaults downgrade Consul Nomad
Medium Severity
Default consul_version and nomad_version in both GCP and AWS disk-image Packer configs were lowered (1.17.3→1.16.2, 1.8.4→1.6.2). make build only overrides those when PACKER_* env vars are set, so new images can ship older agents than the rest of the cluster.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c9a0b3a. Configure here.
| google_service_account_key = var.google_service_account_key | ||
| gcp_region = local.client_cluster_locations[each.key].region | ||
| gcp_zone = local.client_cluster_locations[each.key].zone | ||
| zones = local.client_cluster_locations[each.key].region != var.gcp_region ? each.value.zones : null |
There was a problem hiding this comment.
Secondary zone ignores MIG placement
Medium Severity
For client clusters in a secondary region, zone still drives filestore_zone (via client_cluster_locations), but distribution_policy_zones is only set from zones when the cluster region differs from the primary. If zones is omitted, the regional MIG can place VMs in any zone in that region while Filestore stays in the configured zone, hurting NFS locality.
Reviewed by Cursor Bugbot for commit 44280bb. Configure here.
| attempts = 0 | ||
| } | ||
|
|
||
| resources { |
There was a problem hiding this comment.
Orchestrator job drops memory limits
Medium Severity
The orchestrator start task no longer declares a resources block (previously memory and memory_max). With client raw_exec using no_cgroups in the same change, Nomad will not reserve or cap memory for this system job, increasing risk of host OOM under load.
Reviewed by Cursor Bugbot for commit 44280bb. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 5 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d9d99b3. Configure here.
| DOCKER_REGISTRY_REGION = var.docker_registry_region | ||
| NOMAD_REGION = var.nomad_region | ||
| CONSUL_DATACENTER = var.consul_datacenter | ||
| CONSUL_RETRY_JOIN_ZONE_PATTERN = var.consul_retry_join_zone_pattern |
There was a problem hiding this comment.
Empty NFS still mounts
High Severity
When filestore_cache_enabled is true, secondary-region client clusters without a matching additional_filestores entry get empty nfs_ip_addresses, but USE_FILESTORE_CACHE still follows the global flag. Startup runs an NFS mount with a blank host under set -e, so those nodes can fail boot instead of skipping cache as documented.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit d9d99b3. Configure here.
| region = coalesce(config.region, var.gcp_region) | ||
| zone = coalesce(config.zone, var.gcp_zone) | ||
| filestore_zone = coalesce(config.filestore_zone, config.zone, var.gcp_zone) | ||
| } |
There was a problem hiding this comment.
filestore_zone config unused
Medium Severity
client_clusters_config now accepts filestore_zone, and client_cluster_locations computes it, but nothing reads that local or passes it to Filestore or worker modules. Per-cluster filestore zone overrides have no effect; only additional_filestores[].location controls regional cache placement.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d9d99b3. Configure here.


Summary
Validation
Note: a later terraform validate rerun hit local provider cache checksum mismatch after removing init-only lockfile hash noise; lockfile is unchanged in the PR.