feat: resolve scout firmware upgrade scripts from static assets#2756
feat: resolve scout firmware upgrade scripts from static assets#2756rahmonov wants to merge 1 commit into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (24)
✅ Files skipped from review due to trivial changes (9)
🚧 Files skipped from review as they are similar to previous changes (15)
Summary by CodeRabbit
WalkthroughAdds Scout firmware script resolution, PXE serving, config propagation, optional script metadata, a CX7 upgrade script with metadata, and deployment image updates for the new script directory. ChangesScout Firmware Script Resolution & PXE Serving
Sequence Diagram(s)sequenceDiagram
participant Handler as host_checking_fw_noclear
participant Resolver as find_scout_script
participant Filesystem as script tree
participant PXEServer as /public/scout-firmware-scripts
Handler->>Resolver: find_scout_script(pxe_public_base_url, vendor, model, component_type)
Resolver->>Filesystem: read upgrade.sh and metadata.toml
Filesystem-->>Resolver: script bytes and timeout metadata
Resolver->>Resolver: compute SHA256 and build PXE URL
Resolver-->>Handler: ScoutFirmwareScript
Handler->>PXEServer: use script URL and file artifact URLs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (4)
crates/pxe/src/main.rs (2)
38-39: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAvoid hard-coding the Scout scripts root in PXE.
This duplicates the scripts-directory contract in code, so non-default
scout_firmware_scripts_directorydeployments can drift and break downloads. Prefer sourcing this path from runtime configuration/env for the PXE service too.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/pxe/src/main.rs` around lines 38 - 39, The hard-coded constant SCOUT_FIRMWARE_SCRIPTS_DIR duplicates the scripts directory configuration and prevents dynamic deployment configurations from being properly honored. Replace this constant with a configuration value or environment variable that is read at runtime, similar to how the scout_firmware_scripts_directory setting is handled in other parts of the codebase. This ensures that the PXE service can respect non-default scout_firmware_scripts_directory deployments without requiring code changes or recompilation.
87-91: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winFail fast when the Scout scripts directory is missing.
The route is mounted even if the directory is absent, which defers failure to runtime upgrade requests (404s). Add a startup existence/readability check for the mounted path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/pxe/src/main.rs` around lines 87 - 91, The ServeDir mounting for SCOUT_FIRMWARE_SCRIPTS_DIR does not validate that the directory exists or is readable before the route is registered, causing failures to occur later during runtime requests instead of at startup. Add a validation check before the .nest_service() call that verifies SCOUT_FIRMWARE_SCRIPTS_DIR exists and is readable, then panic or return an error to fail fast during application startup if the directory is missing or inaccessible.crates/firmware/src/tests/config.rs (1)
223-267: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep a legacy-script parsing regression case.
Line 267 now only validates the
script = Nonepath. Since this feature is explicitly backward-compatible, add a companion case that includes[components.cx7.known_firmware.scout.script]and asserts it still deserializes asSome(...).As per coding guidelines, “Prefer table-driven tests … Reach for a table whenever two or more tests call the same operation.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/firmware/src/tests/config.rs` around lines 223 - 267, The test cx7_component_config_parses_as_first_class_component currently only validates the case where scout.script is None, which doesn't test backward compatibility for scout script deserialization. Refactor this into a table-driven test with two cases: one without the [components.cx7.known_firmware.scout.script] section that asserts scout.script.is_none() (existing behavior), and a second case that includes the scout script configuration block and asserts that scout.script deserializes as Some(...) to ensure the legacy script parsing works correctly.Source: Coding guidelines
crates/machine-controller/src/scout_firmware_scripts.rs (1)
186-205: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winAdd table coverage for rejected path segments.
safe_lookup_keyis the boundary preventing inventory strings from selecting nested or traversal paths, but the tests do not exercise values like../dgxh100,dgx/h100, or space-containing model names. Add a small table so this invariant cannot regress. As per coding guidelines, prefer table-driven tests for functions that map inputs to outputs.Proposed test coverage
fn is_safe_path_segment(value: &str) -> bool { !value.is_empty() && value .chars() .all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || matches!(c, '-' | '_')) } + + #[test] + fn safe_lookup_key_normalizes_only_safe_single_segments() { + for (input, expected) in [ + ("DGXH100", Some("dgxh100")), + ("dgx-h100", Some("dgx-h100")), + ("dgx_h100", Some("dgx_h100")), + ("../dgxh100", None), + ("dgx/h100", None), + ("dgx h100", None), + ("", None), + ] { + assert_eq!(safe_lookup_key(input), expected.map(String::from), "{input:?}"); + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/scout_firmware_scripts.rs` around lines 186 - 205, The safe_lookup_key and is_safe_path_segment functions lack comprehensive test coverage for unsafe path segments, creating risk of regression on security boundaries that prevent traversal attacks. Add a table-driven test for safe_lookup_key that exercises both accepted cases (like "dgxh100", "dgx-h100") and rejected cases (like "../dgxh100", "dgx/h100", "dgx h100" with spaces) to ensure the invariant preventing nested or traversal paths cannot regress.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-controller/src/config/firmware_global.rs`:
- Around line 65-68: The scout_firmware_scripts_directory configuration field
allows runtime customization of where firmware scripts are located, but PXE
serves the scripts from a hardcoded SCOUT_FIRMWARE_SCRIPTS_DIR constant. This
creates a divergence where the controller's script resolver and PXE's static
serving can point to different directories. To fix this, ensure that PXE's
static serving route is configured to use the same
scout_firmware_scripts_directory value from the FirmwareGlobal configuration at
runtime, rather than relying on the hardcoded SCOUT_FIRMWARE_SCRIPTS_DIR
constant, so both the resolver and PXE serving always use the same source path.
In `@crates/machine-controller/src/config/mod.rs`:
- Line 37: The new required field pxe_public_base_url has been added to the
struct on line 37, but the struct literal in the test_default function (around
lines 56-67) is missing this field, causing the test build to fail. Locate the
struct initialization in test_default and add the pxe_public_base_url field with
an appropriate test value (such as a sample URL string) to match the String type
of the field.
In `@crates/machine-controller/src/handler.rs`:
- Around line 8217-8233: The to_pxe_url closure uses string-based strip_prefix
instead of path-aware prefix stripping, which causes incorrect behavior when
firmware_directory matches directory name prefixes (e.g., /opt/nico/firmware and
/opt/nico/firmware2). Replace the string-based strip_prefix call with
Path::strip_prefix to enforce semantic correctness for path manipulation. Change
the to_pxe_url closure signature to return Result<String, StateHandlerError>
instead of String, and return an error when the file path falls outside the
firmware_directory boundary instead of silently accepting it. Update the map
operation that calls to_pxe_url (around line 8252) to properly propagate the
error through the Result chain.
In `@crates/machine-controller/src/scout_firmware_scripts.rs`:
- Around line 79-96: Replace the `.exists()` calls with `.try_exists()` to
properly propagate filesystem errors instead of silently converting them to
false. Specifically, in the root.exists() check on line 79 and in the
script_path.exists() and metadata_path.exists() checks on line 94 within the
get_scout_firmware_script_root function, change these calls to use try_exists()
and apply the ? operator to propagate any Result errors. Wrap each try_exists()
call with wrap_err_with() to add contextual information about what filesystem
access was being attempted, enabling operators to distinguish between missing
scripts and inaccessible configured scripts.
---
Nitpick comments:
In `@crates/firmware/src/tests/config.rs`:
- Around line 223-267: The test
cx7_component_config_parses_as_first_class_component currently only validates
the case where scout.script is None, which doesn't test backward compatibility
for scout script deserialization. Refactor this into a table-driven test with
two cases: one without the [components.cx7.known_firmware.scout.script] section
that asserts scout.script.is_none() (existing behavior), and a second case that
includes the scout script configuration block and asserts that scout.script
deserializes as Some(...) to ensure the legacy script parsing works correctly.
In `@crates/machine-controller/src/scout_firmware_scripts.rs`:
- Around line 186-205: The safe_lookup_key and is_safe_path_segment functions
lack comprehensive test coverage for unsafe path segments, creating risk of
regression on security boundaries that prevent traversal attacks. Add a
table-driven test for safe_lookup_key that exercises both accepted cases (like
"dgxh100", "dgx-h100") and rejected cases (like "../dgxh100", "dgx/h100", "dgx
h100" with spaces) to ensure the invariant preventing nested or traversal paths
cannot regress.
In `@crates/pxe/src/main.rs`:
- Around line 38-39: The hard-coded constant SCOUT_FIRMWARE_SCRIPTS_DIR
duplicates the scripts directory configuration and prevents dynamic deployment
configurations from being properly honored. Replace this constant with a
configuration value or environment variable that is read at runtime, similar to
how the scout_firmware_scripts_directory setting is handled in other parts of
the codebase. This ensures that the PXE service can respect non-default
scout_firmware_scripts_directory deployments without requiring code changes or
recompilation.
- Around line 87-91: The ServeDir mounting for SCOUT_FIRMWARE_SCRIPTS_DIR does
not validate that the directory exists or is readable before the route is
registered, causing failures to occur later during runtime requests instead of
at startup. Add a validation check before the .nest_service() call that verifies
SCOUT_FIRMWARE_SCRIPTS_DIR exists and is readable, then panic or return an error
to fail fast during application startup if the directory is missing or
inaccessible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 239bb055-80f8-4108-a94e-84a538cac14f
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
crates/api-core/src/cfg/README.mdcrates/api-core/src/cfg/file.rscrates/api-core/src/cfg/test_data/full_config.tomlcrates/api-core/src/test_support/default_config.rscrates/api-model/src/firmware.rscrates/firmware/src/tests/config.rscrates/machine-controller/Cargo.tomlcrates/machine-controller/src/config/firmware_global.rscrates/machine-controller/src/config/mod.rscrates/machine-controller/src/handler.rscrates/machine-controller/src/lib.rscrates/machine-controller/src/scout_firmware_scripts.rscrates/pxe/src/main.rsdev/deployment/localdev/Dockerfile.api.localdevdev/deployment/localdev/Dockerfile.api.localdev.minikubedev/deployment/localdev/Dockerfile.pxe.localdevdev/deployment/localdev/Dockerfile.pxe.localdev.minikubedev/docker/Dockerfile.release-container-aarch64dev/docker/Dockerfile.release-container-sa-x86_64dev/docker/Dockerfile.release-container-x86_64pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.tomlpxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.shskaffold.yml
de5b23c to
ada9be4
Compare
There was a problem hiding this comment.
This is a tested script from here: https://gitlab-master.nvidia.com/nvmetal/forge-firmware/-/tree/main/nvidia/DGXH100/cx7/28.48.1000/scripts
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
|
@coderabbitai PTAL |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/machine-controller/src/handler.rs (1)
8235-8270: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winGate Scout task creation on local artifact availability.
This branch now serves firmware artifacts from PXE, but it bypasses the existing
self.downloader.available(...)deferral used by the Redfish path. If the script resolves before everyto_install.filesartifact has been downloaded intofirmware_directory, Scout receives PXE URLs that can 404 and fail the upgrade instead of waiting for the next reconciliation pass.Proposed fix
if let Some(scout_script) = scout_script { let firmware_dir = &ctx.services.site_config.firmware_global.firmware_directory; let pxe_public_base_url = ctx .services .site_config .pxe_public_base_url .trim_end_matches('/'); + + if let Some(artifact) = to_install.files.iter().find(|artifact| { + !self.downloader.available( + Path::new(&artifact.filename), + &artifact.url, + &artifact.sha256, + ) + }) { + tracing::debug!( + filename = %artifact.filename, + url = %artifact.url, + "Scout firmware artifact is still downloading" + ); + return Ok(StateHandlerOutcome::do_nothing()); + } let upgrade_task_id = uuid::Uuid::new_v4().to_string(); let file_artifact_count = to_install.files.len();As per path instructions, controller logic should be reviewed for reconciliation correctness, state-machine transitions, and safe recovery from partial failures.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 8235 - 8270, The Scout task creation block (starting with if let Some(scout_script)) does not check whether all firmware artifacts are available locally before proceeding, unlike the Redfish path which uses self.downloader.available(...) for deferral. Add a check to verify that all artifacts in to_install.files are available in the firmware directory by calling self.downloader.available(...) for each file before creating the ScoutFirmwareUpgradeTask. If any artifact is not yet available, return an error or defer task creation to the next reconciliation pass rather than allowing Scout to receive PXE URLs that may 404.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 184-204: The firmware_artifact_url function performs a lexical
check with Path::strip_prefix, which allows paths containing `..` components to
pass through without normalization, potentially enabling directory traversal
attacks. After obtaining the relative path from strip_prefix, validate that it
does not contain parent directory (..) components by iterating through path
components or checking the string representation and rejecting any paths that
include `..` as a component. This check should occur after the strip_prefix call
but before converting to a string and generating the final URL.
In `@pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.sh`:
- Around line 90-93: The printf statement on line 90 logs a mlxfwmanager command
without the `-f` flag, but the actual mlxfwmanager execution on line 93 includes
the `-f` flag. Update the printf statement to include the `-f` flag in the
logged command so that the output message accurately reflects the flags being
passed to mlxfwmanager during execution. This ensures the logged output matches
the actual command for debugging clarity.
---
Outside diff comments:
In `@crates/machine-controller/src/handler.rs`:
- Around line 8235-8270: The Scout task creation block (starting with if let
Some(scout_script)) does not check whether all firmware artifacts are available
locally before proceeding, unlike the Redfish path which uses
self.downloader.available(...) for deferral. Add a check to verify that all
artifacts in to_install.files are available in the firmware directory by calling
self.downloader.available(...) for each file before creating the
ScoutFirmwareUpgradeTask. If any artifact is not yet available, return an error
or defer task creation to the next reconciliation pass rather than allowing
Scout to receive PXE URLs that may 404.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eb16bbfd-6d79-43fd-a1ab-676c3769b939
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (24)
crates/api-core/src/cfg/README.mdcrates/api-core/src/cfg/file.rscrates/api-core/src/cfg/test_data/full_config.tomlcrates/api-core/src/test_support/default_config.rscrates/api-model/src/firmware.rscrates/firmware/src/tests/config.rscrates/machine-controller/Cargo.tomlcrates/machine-controller/src/config/mod.rscrates/machine-controller/src/handler.rscrates/machine-controller/src/lib.rscrates/machine-controller/src/scout_firmware_scripts.rscrates/pxe/Cargo.tomlcrates/pxe/src/main.rscrates/utils/src/lib.rsdev/deployment/localdev/Dockerfile.api.localdevdev/deployment/localdev/Dockerfile.api.localdev.minikubedev/deployment/localdev/Dockerfile.pxe.localdevdev/deployment/localdev/Dockerfile.pxe.localdev.minikubedev/docker/Dockerfile.release-container-aarch64dev/docker/Dockerfile.release-container-sa-x86_64dev/docker/Dockerfile.release-container-x86_64pxe/scout-firmware-scripts/nvidia/dgxh100/cx7/metadata.tomlpxe/scout-firmware-scripts/nvidia/dgxh100/cx7/upgrade.shskaffold.yml
ianderson-nvidia
left a comment
There was a problem hiding this comment.
I can't tell from the changes, but is this breaking the previous upgrade path (the metadata containing a URL to a script)?
| if let Some(to_install) = | ||
| need_host_fw_upgrade(&explored_endpoint, &fw_info, firmware_type) | ||
| { | ||
| if let Some(scout_config) = &to_install.scout { |
There was a problem hiding this comment.
@ianderson-nvidia this is the relevant bit to your question. It will not use the "old" config anymore, instead it will try to find a script from the static files.
ada9be4 to
cc7a06e
Compare
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2756.docs.buildwithfern.com/infra-controller |
This PR adds static Scout firmware upgrade scripts.
Instead of requiring firmware metadata to provide an arbitrary Scout script URL, NICo now resolves Scout upgrade scripts from checked-in static assets using the machine vendor, model, and firmware component type. The machine-controller reads the matching script and metadata, computes the script SHA-256, applies the configured per-script timeouts, and builds the Scout firmware upgrade task with a PXE-served script URL.
This will be especially needed when NICo moves to an API based approach for firmware metadata.
Suggested order of files to review:
Type of Change
Breaking Changes
Testing
Additional Notes
The API and PXE images both copy the Scout script assets into:
/opt/carbide/scout-firmware-scriptsPXE serves that directory from a fixed image path. The API/machine-controller reads from the configured local script directory so it can compute the script hash and read script metadata before constructing the Scout task.
skaffold.ymlsyncspxe/scout-firmware-scripts/**/*for local development.