Skip to content

feat: resolve scout firmware upgrade scripts from static assets#2756

Open
rahmonov wants to merge 1 commit into
NVIDIA:mainfrom
rahmonov:local-scout-scripts
Open

feat: resolve scout firmware upgrade scripts from static assets#2756
rahmonov wants to merge 1 commit into
NVIDIA:mainfrom
rahmonov:local-scout-scripts

Conversation

@rahmonov

@rahmonov rahmonov commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

This PR adds static Scout firmware upgrade scripts.

Instead of requiring firmware metadata to provide an arbitrary Scout script URL, NICo now resolves Scout upgrade scripts from checked-in static assets using the machine vendor, model, and firmware component type. The machine-controller reads the matching script and metadata, computes the script SHA-256, applies the configured per-script timeouts, and builds the Scout firmware upgrade task with a PXE-served script URL.

This will be especially needed when NICo moves to an API based approach for firmware metadata.

Suggested order of files to review:

  1. Start with the handler: crates/machine-controller/src/handler.rs. Will help understand the changes on a high level.
  2. Then crates/machine-controller/src/scout_firmware_scripts.rs. Helper functions to parse and find a scout script.
  3. The rest.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

The API and PXE images both copy the Scout script assets into:

/opt/carbide/scout-firmware-scripts

PXE serves that directory from a fixed image path. The API/machine-controller reads from the configured local script directory so it can compute the script hash and read script metadata before constructing the Scout task.

skaffold.yml syncs pxe/scout-firmware-scripts/**/* for local development.

@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 71e69b62-9201-4313-a59b-1fb7d2f0c2f2

📥 Commits

Reviewing files that changed from the base of the PR and between ada9be4 and cc7a06e.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (24)
  • crates/api-core/src/cfg/README.md
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/cfg/test_data/full_config.toml
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-model/src/firmware.rs
  • crates/firmware/src/tests/config.rs
  • crates/machine-controller/Cargo.toml
  • crates/machine-controller/src/config/mod.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/lib.rs
  • crates/machine-controller/src/scout_firmware_scripts.rs
  • crates/pxe/Cargo.toml
  • crates/pxe/src/main.rs
  • crates/utils/src/lib.rs
  • dev/deployment/localdev/Dockerfile.api.localdev
  • dev/deployment/localdev/Dockerfile.api.localdev.minikube
  • dev/deployment/localdev/Dockerfile.pxe.localdev
  • dev/deployment/localdev/Dockerfile.pxe.localdev.minikube
  • dev/docker/Dockerfile.release-container-aarch64
  • dev/docker/Dockerfile.release-container-sa-x86_64
  • dev/docker/Dockerfile.release-container-x86_64
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.toml
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh
  • skaffold.yml
✅ Files skipped from review due to trivial changes (9)
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.toml
  • skaffold.yml
  • crates/api-core/src/cfg/test_data/full_config.toml
  • dev/docker/Dockerfile.release-container-sa-x86_64
  • crates/utils/src/lib.rs
  • crates/machine-controller/src/lib.rs
  • crates/api-core/src/cfg/README.md
  • dev/deployment/localdev/Dockerfile.pxe.localdev
  • crates/pxe/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (15)
  • dev/docker/Dockerfile.release-container-aarch64
  • crates/pxe/src/main.rs
  • crates/machine-controller/src/config/mod.rs
  • dev/docker/Dockerfile.release-container-x86_64
  • dev/deployment/localdev/Dockerfile.api.localdev
  • crates/machine-controller/Cargo.toml
  • dev/deployment/localdev/Dockerfile.api.localdev.minikube
  • crates/api-core/src/test_support/default_config.rs
  • crates/firmware/src/tests/config.rs
  • crates/api-core/src/cfg/file.rs
  • crates/api-model/src/firmware.rs
  • dev/deployment/localdev/Dockerfile.pxe.localdev.minikube
  • crates/machine-controller/src/handler.rs
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh
  • crates/machine-controller/src/scout_firmware_scripts.rs

Summary by CodeRabbit

  • New Features

    • Added local resolution and HTTP serving of Scout firmware upgrade scripts.
    • Introduced pxe_public_base_url to control the public base URL used for PXE firmware artifact links.
    • Made Scout firmware script optional (it can be omitted).
  • Bug Fixes

    • Improved firmware upgrade handling with safer PXE artifact URL generation and clearer errors when script resolution fails.
  • Chores

    • Updated local development and release container images, plus sync patterns, to include the Scout firmware scripts and added/updated script metadata timeouts.

Walkthrough

Adds Scout firmware script resolution, PXE serving, config propagation, optional script metadata, a CX7 upgrade script with metadata, and deployment image updates for the new script directory.

Changes

Scout Firmware Script Resolution & PXE Serving

Layer / File(s) Summary
pxe_public_base_url configuration field and propagation
crates/api-core/src/cfg/README.md, crates/api-core/src/cfg/file.rs, crates/api-core/src/cfg/test_data/full_config.toml, crates/api-core/src/test_support/default_config.rs, crates/machine-controller/src/config/mod.rs
Adds pxe_public_base_url: String to CarbideConfig with a serde default helper, propagates it into MachineStateHandlerSiteConfig, updates test fixtures and the default config helper, and documents the field in the API config reference.
ScoutConfig.script made optional
crates/api-model/src/firmware.rs, crates/firmware/src/tests/config.rs
ScoutConfig.script changes from a required FirmwareFileArtifact to Option<FirmwareFileArtifact> with serde default/skip, ScoutConfig gains Default derive, and the CX7 config test removes the inline script block and asserts script.is_none().
New scout_firmware_scripts resolver module
crates/machine-controller/Cargo.toml, crates/machine-controller/src/lib.rs, crates/machine-controller/src/scout_firmware_scripts.rs
Declares the crate-visible module, defines ScoutFirmwareScript and ScoutFirmwareScriptMetadata, implements find_scout_script with filesystem resolution, path-safety validation, component directory mapping, SHA256 computation, TOML metadata parsing, and PXE URL construction. Adds hex, sha2, toml, and tempfile dependencies. Includes full unit test coverage.
Handler: dynamic script resolution and artifact URL construction
crates/machine-controller/src/handler.rs
Adds firmware_artifact_url helper for PXE URL derivation. Replaces static to_install.scout logic in host_checking_fw_noclear with find_scout_script, builds file_artifacts via firmware_artifact_url, and constructs ScoutFirmwareUpgradeTask from resolved scout_script fields. Adds unit tests for firmware_artifact_url.
PXE server: SCOUT_FIRMWARE_SCRIPTS_DIR constant and static route
crates/utils/src/lib.rs, crates/pxe/Cargo.toml, crates/pxe/src/main.rs
Adds the SCOUT_FIRMWARE_SCRIPTS_DIR constant in carbide-utils, adds the crate as a PXE dependency, and mounts a nest_service at /public/scout-firmware-scripts backed by ServeDir with 10 MiB buffering.
NVIDIA CX7 upgrade.sh and metadata.toml
pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/*
Adds the CX7 Scout firmware upgrade script with mlxfwmanager XML querying, awk-based ConnectX-7 device filtering, and unattended update execution. Adds metadata with execution_timeout_seconds=7200 and artifact_download_timeout_seconds=600.
Docker and Skaffold packaging
dev/deployment/localdev/Dockerfile.*, dev/docker/Dockerfile.release-container-*, skaffold.yml
All localdev and release Dockerfiles gain RUN mkdir -p and COPY steps for /opt/carbide/scout-firmware-scripts. Skaffold sync patterns include pxe/scout-firmware-scripts/**/*.

Sequence Diagram(s)

sequenceDiagram
    participant Handler as host_checking_fw_noclear
    participant Resolver as find_scout_script
    participant Filesystem as script tree
    participant PXEServer as /public/scout-firmware-scripts
    Handler->>Resolver: find_scout_script(pxe_public_base_url, vendor, model, component_type)
    Resolver->>Filesystem: read upgrade.sh and metadata.toml
    Filesystem-->>Resolver: script bytes and timeout metadata
    Resolver->>Resolver: compute SHA256 and build PXE URL
    Resolver-->>Handler: ScoutFirmwareScript
    Handler->>PXEServer: use script URL and file artifact URLs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: resolving Scout firmware upgrade scripts from static assets.
Description check ✅ Passed The description accurately matches the changeset and explains the static-asset Scout script resolution flow.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
crates/pxe/src/main.rs (2)

38-39: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Avoid hard-coding the Scout scripts root in PXE.

This duplicates the scripts-directory contract in code, so non-default scout_firmware_scripts_directory deployments can drift and break downloads. Prefer sourcing this path from runtime configuration/env for the PXE service too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/pxe/src/main.rs` around lines 38 - 39, The hard-coded constant
SCOUT_FIRMWARE_SCRIPTS_DIR duplicates the scripts directory configuration and
prevents dynamic deployment configurations from being properly honored. Replace
this constant with a configuration value or environment variable that is read at
runtime, similar to how the scout_firmware_scripts_directory setting is handled
in other parts of the codebase. This ensures that the PXE service can respect
non-default scout_firmware_scripts_directory deployments without requiring code
changes or recompilation.

87-91: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Fail fast when the Scout scripts directory is missing.

The route is mounted even if the directory is absent, which defers failure to runtime upgrade requests (404s). Add a startup existence/readability check for the mounted path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/pxe/src/main.rs` around lines 87 - 91, The ServeDir mounting for
SCOUT_FIRMWARE_SCRIPTS_DIR does not validate that the directory exists or is
readable before the route is registered, causing failures to occur later during
runtime requests instead of at startup. Add a validation check before the
.nest_service() call that verifies SCOUT_FIRMWARE_SCRIPTS_DIR exists and is
readable, then panic or return an error to fail fast during application startup
if the directory is missing or inaccessible.
crates/firmware/src/tests/config.rs (1)

223-267: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep a legacy-script parsing regression case.

Line 267 now only validates the script = None path. Since this feature is explicitly backward-compatible, add a companion case that includes [components.cx7.known_firmware.scout.script] and asserts it still deserializes as Some(...).

As per coding guidelines, “Prefer table-driven tests … Reach for a table whenever two or more tests call the same operation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/firmware/src/tests/config.rs` around lines 223 - 267, The test
cx7_component_config_parses_as_first_class_component currently only validates
the case where scout.script is None, which doesn't test backward compatibility
for scout script deserialization. Refactor this into a table-driven test with
two cases: one without the [components.cx7.known_firmware.scout.script] section
that asserts scout.script.is_none() (existing behavior), and a second case that
includes the scout script configuration block and asserts that scout.script
deserializes as Some(...) to ensure the legacy script parsing works correctly.

Source: Coding guidelines

crates/machine-controller/src/scout_firmware_scripts.rs (1)

186-205: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Add table coverage for rejected path segments.

safe_lookup_key is the boundary preventing inventory strings from selecting nested or traversal paths, but the tests do not exercise values like ../dgxh100, dgx/h100, or space-containing model names. Add a small table so this invariant cannot regress. As per coding guidelines, prefer table-driven tests for functions that map inputs to outputs.

Proposed test coverage
     fn is_safe_path_segment(value: &str) -> bool {
         !value.is_empty()
             && value
                 .chars()
                 .all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || matches!(c, '-' | '_'))
     }
+
+    #[test]
+    fn safe_lookup_key_normalizes_only_safe_single_segments() {
+        for (input, expected) in [
+            ("DGXH100", Some("dgxh100")),
+            ("dgx-h100", Some("dgx-h100")),
+            ("dgx_h100", Some("dgx_h100")),
+            ("../dgxh100", None),
+            ("dgx/h100", None),
+            ("dgx h100", None),
+            ("", None),
+        ] {
+            assert_eq!(safe_lookup_key(input), expected.map(String::from), "{input:?}");
+        }
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/scout_firmware_scripts.rs` around lines 186 -
205, The safe_lookup_key and is_safe_path_segment functions lack comprehensive
test coverage for unsafe path segments, creating risk of regression on security
boundaries that prevent traversal attacks. Add a table-driven test for
safe_lookup_key that exercises both accepted cases (like "dgxh100", "dgx-h100")
and rejected cases (like "../dgxh100", "dgx/h100", "dgx h100" with spaces) to
ensure the invariant preventing nested or traversal paths cannot regress.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/config/firmware_global.rs`:
- Around line 65-68: The scout_firmware_scripts_directory configuration field
allows runtime customization of where firmware scripts are located, but PXE
serves the scripts from a hardcoded SCOUT_FIRMWARE_SCRIPTS_DIR constant. This
creates a divergence where the controller's script resolver and PXE's static
serving can point to different directories. To fix this, ensure that PXE's
static serving route is configured to use the same
scout_firmware_scripts_directory value from the FirmwareGlobal configuration at
runtime, rather than relying on the hardcoded SCOUT_FIRMWARE_SCRIPTS_DIR
constant, so both the resolver and PXE serving always use the same source path.

In `@crates/machine-controller/src/config/mod.rs`:
- Line 37: The new required field pxe_public_base_url has been added to the
struct on line 37, but the struct literal in the test_default function (around
lines 56-67) is missing this field, causing the test build to fail. Locate the
struct initialization in test_default and add the pxe_public_base_url field with
an appropriate test value (such as a sample URL string) to match the String type
of the field.

In `@crates/machine-controller/src/handler.rs`:
- Around line 8217-8233: The to_pxe_url closure uses string-based strip_prefix
instead of path-aware prefix stripping, which causes incorrect behavior when
firmware_directory matches directory name prefixes (e.g., /opt/nico/firmware and
/opt/nico/firmware2). Replace the string-based strip_prefix call with
Path::strip_prefix to enforce semantic correctness for path manipulation. Change
the to_pxe_url closure signature to return Result<String, StateHandlerError>
instead of String, and return an error when the file path falls outside the
firmware_directory boundary instead of silently accepting it. Update the map
operation that calls to_pxe_url (around line 8252) to properly propagate the
error through the Result chain.

In `@crates/machine-controller/src/scout_firmware_scripts.rs`:
- Around line 79-96: Replace the `.exists()` calls with `.try_exists()` to
properly propagate filesystem errors instead of silently converting them to
false. Specifically, in the root.exists() check on line 79 and in the
script_path.exists() and metadata_path.exists() checks on line 94 within the
get_scout_firmware_script_root function, change these calls to use try_exists()
and apply the ? operator to propagate any Result errors. Wrap each try_exists()
call with wrap_err_with() to add contextual information about what filesystem
access was being attempted, enabling operators to distinguish between missing
scripts and inaccessible configured scripts.

---

Nitpick comments:
In `@crates/firmware/src/tests/config.rs`:
- Around line 223-267: The test
cx7_component_config_parses_as_first_class_component currently only validates
the case where scout.script is None, which doesn't test backward compatibility
for scout script deserialization. Refactor this into a table-driven test with
two cases: one without the [components.cx7.known_firmware.scout.script] section
that asserts scout.script.is_none() (existing behavior), and a second case that
includes the scout script configuration block and asserts that scout.script
deserializes as Some(...) to ensure the legacy script parsing works correctly.

In `@crates/machine-controller/src/scout_firmware_scripts.rs`:
- Around line 186-205: The safe_lookup_key and is_safe_path_segment functions
lack comprehensive test coverage for unsafe path segments, creating risk of
regression on security boundaries that prevent traversal attacks. Add a
table-driven test for safe_lookup_key that exercises both accepted cases (like
"dgxh100", "dgx-h100") and rejected cases (like "../dgxh100", "dgx/h100", "dgx
h100" with spaces) to ensure the invariant preventing nested or traversal paths
cannot regress.

In `@crates/pxe/src/main.rs`:
- Around line 38-39: The hard-coded constant SCOUT_FIRMWARE_SCRIPTS_DIR
duplicates the scripts directory configuration and prevents dynamic deployment
configurations from being properly honored. Replace this constant with a
configuration value or environment variable that is read at runtime, similar to
how the scout_firmware_scripts_directory setting is handled in other parts of
the codebase. This ensures that the PXE service can respect non-default
scout_firmware_scripts_directory deployments without requiring code changes or
recompilation.
- Around line 87-91: The ServeDir mounting for SCOUT_FIRMWARE_SCRIPTS_DIR does
not validate that the directory exists or is readable before the route is
registered, causing failures to occur later during runtime requests instead of
at startup. Add a validation check before the .nest_service() call that verifies
SCOUT_FIRMWARE_SCRIPTS_DIR exists and is readable, then panic or return an error
to fail fast during application startup if the directory is missing or
inaccessible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 239bb055-80f8-4108-a94e-84a538cac14f

📥 Commits

Reviewing files that changed from the base of the PR and between 4acacdb and cdc8d03.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • crates/api-core/src/cfg/README.md
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/cfg/test_data/full_config.toml
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-model/src/firmware.rs
  • crates/firmware/src/tests/config.rs
  • crates/machine-controller/Cargo.toml
  • crates/machine-controller/src/config/firmware_global.rs
  • crates/machine-controller/src/config/mod.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/lib.rs
  • crates/machine-controller/src/scout_firmware_scripts.rs
  • crates/pxe/src/main.rs
  • dev/deployment/localdev/Dockerfile.api.localdev
  • dev/deployment/localdev/Dockerfile.api.localdev.minikube
  • dev/deployment/localdev/Dockerfile.pxe.localdev
  • dev/deployment/localdev/Dockerfile.pxe.localdev.minikube
  • dev/docker/Dockerfile.release-container-aarch64
  • dev/docker/Dockerfile.release-container-sa-x86_64
  • dev/docker/Dockerfile.release-container-x86_64
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.toml
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh
  • skaffold.yml

Comment thread crates/machine-controller/src/config/firmware_global.rs Outdated
Comment thread crates/machine-controller/src/config/mod.rs
Comment thread crates/machine-controller/src/handler.rs Outdated
Comment thread crates/machine-controller/src/scout_firmware_scripts.rs
@rahmonov rahmonov force-pushed the local-scout-scripts branch 2 times, most recently from de5b23c to ada9be4 Compare June 23, 2026 13:56

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahmonov rahmonov requested a review from krish-nvidia June 23, 2026 13:58
@rahmonov rahmonov marked this pull request as ready for review June 23, 2026 13:58
@rahmonov rahmonov requested a review from a team as a code owner June 23, 2026 13:58
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 265 6 24 98 7 130
machine-validation-runner 717 32 188 267 36 194
machine_validation 717 32 188 267 36 194
nvmetal-carbide 717 32 188 267 36 194
TOTAL 2422 102 588 905 115 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@ianderson-nvidia

Copy link
Copy Markdown
Contributor

@coderabbitai PTAL

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@ianderson-nvidia: I will conduct a thorough review of PR #2756. I'll examine the static-asset Scout firmware script resolution, the path sanitization logic in scout_firmware_scripts.rs, the handler changes, the API model update making ScoutConfig.script optional, and the configuration and deployment changes.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)

8235-8270: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Gate Scout task creation on local artifact availability.

This branch now serves firmware artifacts from PXE, but it bypasses the existing self.downloader.available(...) deferral used by the Redfish path. If the script resolves before every to_install.files artifact has been downloaded into firmware_directory, Scout receives PXE URLs that can 404 and fail the upgrade instead of waiting for the next reconciliation pass.

Proposed fix
                 if let Some(scout_script) = scout_script {
                     let firmware_dir = &ctx.services.site_config.firmware_global.firmware_directory;
                     let pxe_public_base_url = ctx
                         .services
                         .site_config
                         .pxe_public_base_url
                         .trim_end_matches('/');
+
+                    if let Some(artifact) = to_install.files.iter().find(|artifact| {
+                        !self.downloader.available(
+                            Path::new(&artifact.filename),
+                            &artifact.url,
+                            &artifact.sha256,
+                        )
+                    }) {
+                        tracing::debug!(
+                            filename = %artifact.filename,
+                            url = %artifact.url,
+                            "Scout firmware artifact is still downloading"
+                        );
+                        return Ok(StateHandlerOutcome::do_nothing());
+                    }
 
                     let upgrade_task_id = uuid::Uuid::new_v4().to_string();
                     let file_artifact_count = to_install.files.len();

As per path instructions, controller logic should be reviewed for reconciliation correctness, state-machine transitions, and safe recovery from partial failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 8235 - 8270, The Scout
task creation block (starting with if let Some(scout_script)) does not check
whether all firmware artifacts are available locally before proceeding, unlike
the Redfish path which uses self.downloader.available(...) for deferral. Add a
check to verify that all artifacts in to_install.files are available in the
firmware directory by calling self.downloader.available(...) for each file
before creating the ScoutFirmwareUpgradeTask. If any artifact is not yet
available, return an error or defer task creation to the next reconciliation
pass rather than allowing Scout to receive PXE URLs that may 404.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 184-204: The firmware_artifact_url function performs a lexical
check with Path::strip_prefix, which allows paths containing `..` components to
pass through without normalization, potentially enabling directory traversal
attacks. After obtaining the relative path from strip_prefix, validate that it
does not contain parent directory (..) components by iterating through path
components or checking the string representation and rejecting any paths that
include `..` as a component. This check should occur after the strip_prefix call
but before converting to a string and generating the final URL.

In `@pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh`:
- Around line 90-93: The printf statement on line 90 logs a mlxfwmanager command
without the `-f` flag, but the actual mlxfwmanager execution on line 93 includes
the `-f` flag. Update the printf statement to include the `-f` flag in the
logged command so that the output message accurately reflects the flags being
passed to mlxfwmanager during execution. This ensures the logged output matches
the actual command for debugging clarity.

---

Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 8235-8270: The Scout task creation block (starting with if let
Some(scout_script)) does not check whether all firmware artifacts are available
locally before proceeding, unlike the Redfish path which uses
self.downloader.available(...) for deferral. Add a check to verify that all
artifacts in to_install.files are available in the firmware directory by calling
self.downloader.available(...) for each file before creating the
ScoutFirmwareUpgradeTask. If any artifact is not yet available, return an error
or defer task creation to the next reconciliation pass rather than allowing
Scout to receive PXE URLs that may 404.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb16bbfd-6d79-43fd-a1ab-676c3769b939

📥 Commits

Reviewing files that changed from the base of the PR and between 784be37 and ada9be4.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (24)
  • crates/api-core/src/cfg/README.md
  • crates/api-core/src/cfg/file.rs
  • crates/api-core/src/cfg/test_data/full_config.toml
  • crates/api-core/src/test_support/default_config.rs
  • crates/api-model/src/firmware.rs
  • crates/firmware/src/tests/config.rs
  • crates/machine-controller/Cargo.toml
  • crates/machine-controller/src/config/mod.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/lib.rs
  • crates/machine-controller/src/scout_firmware_scripts.rs
  • crates/pxe/Cargo.toml
  • crates/pxe/src/main.rs
  • crates/utils/src/lib.rs
  • dev/deployment/localdev/Dockerfile.api.localdev
  • dev/deployment/localdev/Dockerfile.api.localdev.minikube
  • dev/deployment/localdev/Dockerfile.pxe.localdev
  • dev/deployment/localdev/Dockerfile.pxe.localdev.minikube
  • dev/docker/Dockerfile.release-container-aarch64
  • dev/docker/Dockerfile.release-container-sa-x86_64
  • dev/docker/Dockerfile.release-container-x86_64
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.toml
  • pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh
  • skaffold.yml

Comment thread crates/machine-controller/src/handler.rs
Comment thread pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh Outdated

@ianderson-nvidia ianderson-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell from the changes, but is this breaking the previous upgrade path (the metadata containing a URL to a script)?

Comment thread crates/machine-controller/src/handler.rs
if let Some(to_install) =
need_host_fw_upgrade(&explored_endpoint, &fw_info, firmware_type)
{
if let Some(scout_config) = &to_install.scout {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianderson-nvidia this is the relevant bit to your question. It will not use the "old" config anymore, instead it will try to find a script from the static files.

@rahmonov rahmonov force-pushed the local-scout-scripts branch from ada9be4 to cc7a06e Compare June 24, 2026 09:00
@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants