Skip to content

fix: more changes to support lenovo GB300s#2809

Open
krish-nvidia wants to merge 2 commits into
NVIDIA:mainfrom
krish-nvidia:krish/followup-gb300-changes
Open

fix: more changes to support lenovo GB300s#2809
krish-nvidia wants to merge 2 commits into
NVIDIA:mainfrom
krish-nvidia:krish/followup-gb300-changes

Conversation

@krish-nvidia

Copy link
Copy Markdown
Contributor

This PR introduces more changes to enable exploration and provisioning of Lenovo GB300 trays (AMI host BMC) in NICo, and fixes a compile break from the nv-redfish 0.10.3 bump.

Changes

  • Password rotation (site-explorer): rotate the AMI/GB300 admin account by username instead of the hardcoded /Accounts/2 (GB300's admin id is 1/4, so the old PATCH 404'd). Falls back to id 2 only when account reads are blocked by PasswordChangeRequired (Viking factory state); any other error propagates.
  • UEFI device-path parsing (api-model): un-anchor PCI_ROOT_REGEX so GB300/Grace boot paths that prefix PciRoot with VenHw(...)/MemoryMapped(...) parse instead of aborting the whole exploration.
  • Build fix (bmc-mock): implement the new RequestError trait (added by nv-redfish 0.10.3) for the AxumRouterHttpClient test error so ServiceRoot<TestBmc> compiles.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia self-assigned this Jun 23, 2026
@krish-nvidia krish-nvidia requested a review from a team as a code owner June 23, 2026 19:22
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Bug Fixes
    • Improved device-path parsing so PCI Root segments are detected correctly within vendor/MMIO-prefixed paths (better support for Grace/GB300-style paths).
    • Updated BMC password change behavior to attempt password change using the resolved current admin username first, then fall back to rotation when required.
    • Enhanced managed-host DPU detection in the no-PCIe-discovery fallback to derive BlueField identity from chassis fields (including chassis serial) rather than relying on network-adapter serials.
  • Tests
    • Added an integration regression test covering DPU pairing using chassis serial.
  • Chores
    • Updated the nv-redfish dependency to version 0.10.3.

Walkthrough

Bumps the nv-redfish workspace dependency from 0.10.0 to 0.10.3, extends the bmc-mock test HTTP client with a RejectedUriReference error variant wired to RejectedUriReferenceError, removes the ^ start anchor from PCI_ROOT_REGEX to handle Grace/GB300 device paths, replaces the unconditional change_password_by_id("2") call with a try-then-fallback strategy in set_bmc_root_password, and adds BlueField DPU detection from chassis fields when the PCIe enumeration fallback path discovers no DPUs.

Changes

nv-redfish upgrade, error wiring, regex fix, password fallback, and chassis-based BlueField detection

Layer / File(s) Summary
nv-redfish bump and mock client RejectedUriReference support
Cargo.toml, crates/bmc-mock/src/test_support/axum_http_client.rs
Bumps nv-redfish to 0.10.3 and extends the mock axum HTTP client Error enum with a RejectedUriReference(String) variant, a corresponding Display arm, and an impl RequestError for Error that converts RejectedUriReferenceError instances into that variant.
PCI_ROOT_REGEX: remove start anchor for Grace/GB300 paths
crates/api-model/src/site_explorer/mod.rs
Removes the ^ anchor from PCI_ROOT_REGEX so that PciRoot(...) is matched at any position within a UefiDevicePath string, accommodating vendor/MMIO prefix nodes present in Grace/GB300-style paths.
set_bmc_root_password: try change_password first, fall back on PasswordChangeRequired
crates/site-explorer/src/redfish.rs
Replaces the unconditional change_password_by_id("2") call in the AMI | LenovoGB300 branch with a match that first attempts change_password(curr_user, ...), falls back to change_password_by_id("2") only on PasswordChangeRequired, and applies double redaction of both new and current passwords before mapping any other error.
BlueField DPU detection from chassis fields in PCIe fallback
crates/site-explorer/src/lib.rs, crates/site-explorer/tests/site_explorer.rs
When no DPUs are discovered via PCIe scan, the identify_managed_hosts fallback path now detects BlueField DPUs where part_number and a non-empty serial_number are present on the chassis object itself, recording the DPU with chassis-sourced identity rather than relying solely on network_adapters. A regression test verifies pairing via chassis serial when PCIe and network-adapter serials are absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2753: Directly modifies the same set_bmc_root_password function and the same AMI | LenovoGB300 branch in crates/site-explorer/src/redfish.rs, extending the change_password_by_id("2") path that this PR now replaces with a try-then-fallback strategy.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the core objective of the PR—delivering multiple fixes and enhancements to support Lenovo GB300 exploration and provisioning.
Description check ✅ Passed The description is well-structured, detailing the three principal changes (password rotation, UEFI parsing, and build fixes) with clear rationale for each modification.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
crates/api-model/src/site_explorer/mod.rs (1)

1368-1371: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Consider validating the single-PciRoot constraint.

The unanchored regex now matches the first occurrence of PciRoot(...) if multiple exist in the path. While unlikely with compliant firmware, malformed paths could contain multiple PciRoot nodes, leading to silent incorrect parsing.

Adding validation that exactly one PciRoot node exists would strengthen the parser and provide clearer error messages for malformed input. This could be achieved by checking that PCI_ROOT_REGEX.captures_iter(st).count() == 1 before proceeding.

🛡️ Proposed validation enhancement
 let mut push_group = |group: &str| -> Result<(), String> {
     // ... existing code ...
 };

+// Validate exactly one PciRoot exists
+let pci_root_count = PCI_ROOT_REGEX.captures_iter(st).count();
+if pci_root_count != 1 {
+    return Err(format!(
+        "Expected exactly one PciRoot node, found {} in: {s}",
+        pci_root_count
+    ));
+}
+
 let root = PCI_ROOT_REGEX
     .captures(st)
     .and_then(|c| c.get(1))
     .ok_or_else(|| format!("Could not match regex in PCI Device Path {s}."))?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-model/src/site_explorer/mod.rs` around lines 1368 - 1371, The
current code using PCI_ROOT_REGEX.captures() only matches the first PciRoot
occurrence, which could silently accept malformed paths with multiple PciRoot
nodes. Add validation before the captures call to ensure exactly one PciRoot
node exists by checking that PCI_ROOT_REGEX.captures_iter(st).count() == 1,
returning a clear error message if the count is not exactly 1. This strengthens
the parser to reject malformed input instead of silently parsing only the first
match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-model/src/site_explorer/mod.rs`:
- Around line 1335-1341: The `PCI_ROOT_REGEX` modification in the site_explorer
module lacks test coverage for the parsing behavior change that now accommodates
GB300/Grace device paths with vendor/MMIO prefixes. Add table-driven tests using
carbide-test-support in a tests module that verify the following scenarios:
GB300/Grace paths with VenHw and MemoryMapped prefixes (e.g.,
`VenHw(...)/MemoryMapped(...)/PciRoot(0x16)/Pci(0x0,0x0)`), original unadorned
paths (e.g., `PciRoot(0x8)/Pci(0x2,0xa)/Pci(0x0,0x0)`), paths with optional MAC
suffixes, and invalid paths (missing Pci nodes, malformed PciRoot, no PciRoot
nodes). This ensures the regex change functions correctly and prevents
regressions in the hardware discovery path.

In `@crates/site-explorer/src/redfish.rs`:
- Around line 247-250: The PasswordChangeRequired error handler is applying an
AMI-specific fallback using hardcoded account ID "2" to all vendors, including
LenovoGB300, but GB300 admin accounts are at IDs "1" or "4", not "2". In the
Err(libredfish::RedfishError::PasswordChangeRequired) branch, add
vendor-specific logic to check whether the client is LenovoGB300 or another
vendor before applying the account ID "2" fallback. For LenovoGB300, either
return a clear error indicating the account cannot be resolved safely, or use
GB300-specific account IDs if a validated fallback path exists. This ensures the
fallback only applies to vendors where account "2" is appropriate and prevents
attempts to change the wrong account.

---

Nitpick comments:
In `@crates/api-model/src/site_explorer/mod.rs`:
- Around line 1368-1371: The current code using PCI_ROOT_REGEX.captures() only
matches the first PciRoot occurrence, which could silently accept malformed
paths with multiple PciRoot nodes. Add validation before the captures call to
ensure exactly one PciRoot node exists by checking that
PCI_ROOT_REGEX.captures_iter(st).count() == 1, returning a clear error message
if the count is not exactly 1. This strengthens the parser to reject malformed
input instead of silently parsing only the first match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b81cf9ab-8659-48cf-858e-ad912853dfd2

📥 Commits

Reviewing files that changed from the base of the PR and between 89c0b75 and 1d09f25.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • Cargo.toml
  • crates/api-model/src/site_explorer/mod.rs
  • crates/bmc-mock/src/test_support/axum_http_client.rs
  • crates/site-explorer/src/redfish.rs

Comment on lines +1335 to +1341
// Not anchored at start: GB300/Grace UEFI device paths prefix the PciRoot
// node with vendor/MMIO nodes, e.g.
// VenHw(<guid>)/MemoryMapped(0xB,...)/PciRoot(0x16)/Pci(0x0,0x0)/Pci(0x0,0x0)
// An `^PciRoot` anchor never matches those and aborts the whole exploration
// (`Could not match regex in PCI Device Path`). Match PciRoot wherever it appears.
static ref PCI_ROOT_REGEX: Regex =
Regex::new(r"^PciRoot\(([^)]*)\)").expect("must always compile");
Regex::new(r"PciRoot\(([^)]*)\)").expect("must always compile");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Add comprehensive test coverage for the regex parsing change.

This modification to PCI_ROOT_REGEX alters parsing behavior to accommodate GB300/Grace device paths with vendor/MMIO prefixes, yet no tests were added to verify correctness. According to the style guide, parsing functions are ideal candidates for table-driven tests using carbide-test-support.

The following scenarios should be covered:

  • GB300/Grace paths with vendor/MMIO prefixes: VenHw(...)/MemoryMapped(...)/PciRoot(0x16)/Pci(0x0,0x0)
  • Original paths without prefixes: PciRoot(0x8)/Pci(0x2,0xa)/Pci(0x0,0x0)
  • Paths with optional /MAC(...) suffix
  • Invalid paths: missing Pci nodes, malformed PciRoot, multiple PciRoot nodes

Without test coverage, we cannot verify the fix works as intended or prevent regressions in this critical hardware discovery path. As per path instructions and STYLE_GUIDE.md testing guidance, add table-driven tests before merging.

📋 Example test structure using carbide-test-support
#[cfg(test)]
mod tests {
    use super::*;
    use carbide_test_support::{scenarios, Outcome::*};

    #[test]
    fn parse_uefi_device_path() {
        scenarios!(UefiDevicePath::from_str:
            "GB300/Grace paths with vendor/MMIO prefix" {
                "VenHw(D3987D4B-971A-435F-8CAF-4967EB627241)/MemoryMapped(0xB,0x3FFBFE00000,0x3FFBFEFFFFF)/PciRoot(0x16)/Pci(0x0,0x0)" 
                    => Yields(UefiDevicePath("22.0.0".into())),
                "VenHw(...)/PciRoot(0x8)/Pci(0x2,0xa)/Pci(0x0,0x0)/MAC(...)" 
                    => Yields(UefiDevicePath("8.2.10.0.0".into())),
            }

            "original paths without prefix" {
                "PciRoot(0x8)/Pci(0x2,0xa)/Pci(0x0,0x0)" 
                    => Yields(UefiDevicePath("8.2.10.0.0".into())),
                "PciRoot(0x7)/Pci(0x0,0x0)" 
                    => Yields(UefiDevicePath("7.0.0".into())),
            }

            "invalid paths" {
                "PciRoot(0x8)" => Fails,  // missing Pci nodes
                "Pci(0x0,0x0)" => Fails,  // missing PciRoot
                "" => Fails,
            }
        );
    }
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-model/src/site_explorer/mod.rs` around lines 1335 - 1341, The
`PCI_ROOT_REGEX` modification in the site_explorer module lacks test coverage
for the parsing behavior change that now accommodates GB300/Grace device paths
with vendor/MMIO prefixes. Add table-driven tests using carbide-test-support in
a tests module that verify the following scenarios: GB300/Grace paths with VenHw
and MemoryMapped prefixes (e.g.,
`VenHw(...)/MemoryMapped(...)/PciRoot(0x16)/Pci(0x0,0x0)`), original unadorned
paths (e.g., `PciRoot(0x8)/Pci(0x2,0xa)/Pci(0x0,0x0)`), paths with optional MAC
suffixes, and invalid paths (missing Pci nodes, malformed PciRoot, no PciRoot
nodes). This ensures the regex change functions correctly and prevents
regressions in the hardware discovery path.

Sources: Coding guidelines, Path instructions

Comment on lines +247 to +250
Err(libredfish::RedfishError::PasswordChangeRequired) => {
client
.change_password_by_id("2", new_password.as_str())
.await

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not apply the AMI /Accounts/2 fallback to LenovoGB300.

This branch includes RedfishVendor::LenovoGB300, but the fallback still PATCHes account ID "2" when account reads are blocked. The PR objective says GB300 admin accounts are at ID 1 or 4, so this can either preserve the GB300 404 failure or rotate the wrong account while the caller proceeds with curr_user/new_password.

Split the PasswordChangeRequired fallback by vendor and use only a validated GB300-specific fallback path, or return a clear error for GB300 when the account cannot be resolved safely.

As per path instructions, BMC-facing code should account for “credential handling” and “vendor differences.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/redfish.rs` around lines 247 - 250, The
PasswordChangeRequired error handler is applying an AMI-specific fallback using
hardcoded account ID "2" to all vendors, including LenovoGB300, but GB300 admin
accounts are at IDs "1" or "4", not "2". In the
Err(libredfish::RedfishError::PasswordChangeRequired) branch, add
vendor-specific logic to check whether the client is LenovoGB300 or another
vendor before applying the account ID "2" fallback. For LenovoGB300, either
return a clear error indicating the account cannot be resolved safely, or use
GB300-specific account IDs if a validated fallback path exists. This ensures the
fallback only applies to vendors where account "2" is appropriate and prevents
attempts to change the wrong account.

Source: Path instructions

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 24 98 6 130
machine-validation-runner 712 32 184 267 35 194
machine_validation 712 32 184 267 35 194
nvmetal-carbide 712 32 184 267 35 194
TOTAL 2406 102 576 905 111 712

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
@krish-nvidia krish-nvidia requested a review from poroh June 23, 2026 21:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/site-explorer/src/lib.rs`:
- Around line 1214-1226: The condition on line 1214 that checks
`chassis_is_bluefield && chassis_has_serial` is too permissive and causes
adapter serials to be skipped even when the chassis serial cannot actually pair.
Modify the condition to only take the chassis path (calling
record_host_dpu_device and skipping the adapter iteration in the else block) if
the chassis serial actually exists as a key in the dpu_sn_to_endpoint map.
Currently it skips adapters whenever chassis_has_serial is true, but it should
only skip them when the chassis serial can actually be paired based on what
exists in dpu_sn_to_endpoint. Check if chassis.serial_number.as_deref() exists
in the dpu_sn_to_endpoint map before deciding to take the chassis path and skip
the adapter iteration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 705c09d0-6d8d-4c21-a397-4ca39e73db41

📥 Commits

Reviewing files that changed from the base of the PR and between 1d09f25 and 4c7373b.

📒 Files selected for processing (2)
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs

Comment on lines +1214 to +1226
if chassis_is_bluefield && chassis_has_serial {
self.record_host_dpu_device(
network_adapter.part_number.as_deref(),
network_adapter.serial_number.as_deref(),
chassis.part_number.as_deref(),
chassis.serial_number.as_deref(),
&dpu_sn_to_endpoint,
host_dpu_mode,
&ep,
&mut dpu_exploration,
metrics,
)
.await;
} else {
for network_adapter in chassis.network_adapters.iter() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Do not skip adapter serials unless the chassis serial can actually pair.

Line 1214 selects the chassis path for any non-empty BlueField chassis serial, but record_host_dpu_device only pairs if that serial exists in dpu_sn_to_endpoint. If the chassis serial is an enclosure/card serial and the nested adapter has the DPU system serial, this skips the adapter and prevents pairing.

Proposed fix
-                    let chassis_has_serial = chassis
+                    let chassis_serial = chassis
                         .serial_number
                         .as_deref()
                         .map(str::trim)
-                        .is_some_and(|serial| !serial.is_empty());
-                    if chassis_is_bluefield && chassis_has_serial {
+                        .filter(|serial| !serial.is_empty());
+                    let chassis_serial_matches_discovered_dpu = chassis_serial
+                        .is_some_and(|serial| dpu_sn_to_endpoint.contains_key(serial));
+                    if chassis_is_bluefield && chassis_serial_matches_discovered_dpu {
                         self.record_host_dpu_device(
                             chassis.part_number.as_deref(),
-                            chassis.serial_number.as_deref(),
+                            chassis_serial,
                             &dpu_sn_to_endpoint,
                             host_dpu_mode,
                             &ep,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 1214 - 1226, The condition on
line 1214 that checks `chassis_is_bluefield && chassis_has_serial` is too
permissive and causes adapter serials to be skipped even when the chassis serial
cannot actually pair. Modify the condition to only take the chassis path
(calling record_host_dpu_device and skipping the adapter iteration in the else
block) if the chassis serial actually exists as a key in the dpu_sn_to_endpoint
map. Currently it skips adapters whenever chassis_has_serial is true, but it
should only skip them when the chassis serial can actually be paired based on
what exists in dpu_sn_to_endpoint. Check if chassis.serial_number.as_deref()
exists in the dpu_sn_to_endpoint map before deciding to take the chassis path
and skip the adapter iteration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants