fix(health): report BMC intrusion events by osu · Pull Request #2782 · NVIDIA/infra-controller

osu · 2026-06-23T04:34:06Z

Summary

detect BMC log events for physical chassis intrusion and emit IntrusionSensorTriggered machine health reports
add hardware-health.bmc-events source plus PreventAllocations/SensorCritical classifications for the generated alert
enable hardware-health BMC log collection in Helm and deploy manifests so the detector receives intrusion events

Validation

PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health intrusion --lib
PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health sink::events --lib
PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health --lib
rustup run nightly-2026-06-16 rustfmt --check crates/health/src/processor/intrusion_events.rs crates/health/src/processor/mod.rs crates/health/src/lib.rs crates/health/src/sink/events.rs
git diff --check
helm lint helm/charts/nico-hardware-health
helm template nico-hardware-health helm/charts/nico-hardware-health
kubectl kustomize deploy/nico-base/hardware-health

Backend follow-up for #2671; complements #2772, which surfaces the resulting alert details on /admin/managed-host.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

coderabbitai · 2026-06-23T04:34:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66466943-4f13-490d-ac4d-6e341c09442f

📥 Commits

Reviewing files that changed from the base of the PR and between a6864dd and bd8e0e9.

📒 Files selected for processing (1)

crates/health/src/processor/intrusion_events.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/health/src/processor/intrusion_events.rs

Summary by CodeRabbit

New Features
- Added chassis intrusion detection and health reporting. The system now detects physical intrusion events and generates health alerts when triggered, with automatic clearing when the intrusion threat is resolved.

Walkthrough

A new BmcIntrusionEventProcessor is introduced to translate BMC log records into health report alerts or successes by keyword-matching against the log body and optional message_args JSON. Three new enum variants are added to the health event sink. The processor is wired into the pipeline, and the logs collector is enabled in deployment configuration.

Changes

BMC Intrusion Health Reporting

Layer / File(s)	Summary
Health event sink enum extensions `crates/health/src/sink/events.rs`	Adds `ReportSource::BmcEvents`, `Probe::IntrusionSensorTriggered`, and `Classification::PreventAllocations` variants with `as_str()` mappings and corresponding unit test scenarios for string conversion and alert conversion output.
BmcIntrusionEventProcessor implementation and tests `crates/health/src/processor/intrusion_events.rs`	Defines `IntrusionEventState` enum and keyword-matching constants, implements helper methods to parse `message_args` JSON, implements `EventProcessor` trait producing `HealthReport` with alert or success based on keyword classification, and provides comprehensive unit tests covering intrusion alert detection (via body and message_args), clear detection, and unrelated log filtering.
Processor module wiring and pipeline registration `crates/health/src/processor/mod.rs`, `crates/health/src/lib.rs`	Declares the `intrusion_events` submodule, re-exports `BmcIntrusionEventProcessor`, imports it in `lib.rs`, and conditionally pushes it into the processor pipeline when the `health_report` sink is enabled.
Deployment configuration enabling logs collector `deploy/nico-base/hardware-health/deployment.yaml`, `helm/charts/nico-hardware-health/values.yaml`	Sets `NICO_HEALTH__COLLECTORS__LOGS__ENABLED` to `"true"` in the Kubernetes deployment manifest and the Helm chart values file.

Sequence Diagram(s)

sequenceDiagram
    participant LogCollector
    participant BmcIntrusionEventProcessor
    participant HealthReportSink

    LogCollector->>BmcIntrusionEventProcessor: CollectorEvent::Log(record)
    BmcIntrusionEventProcessor->>BmcIntrusionEventProcessor: Extract body + message_args
    BmcIntrusionEventProcessor->>BmcIntrusionEventProcessor: intrusion_event_state() keyword match

    alt Alert keywords matched
        BmcIntrusionEventProcessor->>HealthReportSink: HealthReport { alert: IntrusionSensorTriggered, PreventAllocations }
    else Clear keywords matched
        BmcIntrusionEventProcessor->>HealthReportSink: HealthReport { success: IntrusionSensorTriggered }
    else No intrusion keywords
        BmcIntrusionEventProcessor-->>LogCollector: Empty output (no report)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/infra-controller#2772: Directly references the same IntrusionSensorTriggered probe and HostBMC target introduced in this PR, asserting the intrusion alert details are rendered on the managed-host page.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 39.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely and accurately describes the primary change: adding BMC intrusion event reporting to the health system.
Description check	✅ Passed	The description provides relevant context about the changeset, including detection logic, classifications, deployment configuration, and validation steps performed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/health/src/processor/intrusion_events.rs (1)
201-261: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consolidate intrusion classification tests into a table-driven suite.

These tests call the same operation with varied inputs; converting to value_scenarios! will reduce duplication and make edge-case additions (including ambiguous “normal + critical” phrasing) straightforward.

As per coding guidelines, “Prefer table-driven tests… Reach for a table whenever two or more tests call the same operation with different inputs.” As per path instructions, “crates/**/*.rs: … Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/processor/intrusion_events.rs` around lines 201 - 261, The
tests emits_alert_for_physical_chassis_intrusion_log_body,
emits_alert_for_intrusion_message_args,
emits_success_for_cleared_intrusion_event, and
emits_success_for_reset_intrusion_event all call the same operation
emitted_report with different log inputs and verify different aspects of the
result. Consolidate these four test functions into a single table-driven test
using the value_scenarios! macro, where each row in the table represents one
test case with its input parameters and expected assertions. This will reduce
duplication and make it easier to add new edge cases in the future.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/src/processor/intrusion_events.rs`:
- Around line 76-93: The intrusion event classification logic checks for clear
keywords before alert keywords, causing messages containing both
"normal"/"reset" and alert signals like "critical" to be misclassified as Clear.
Reorder the conditional blocks in the intrusion event state detection logic so
that the alert check (which includes keywords like "physical chassis intrusion
alert", "trigger", "triggered", "assert", "alert", "critical", "warning") runs
before the clear check (which includes "clear", "cleared", "deassert", "normal",
"reset"). This ensures that more specific alert signals take priority and
prevents broad keywords like "normal" from causing false clears.

---

Nitpick comments:
In `@crates/health/src/processor/intrusion_events.rs`:
- Around line 201-261: The tests
emits_alert_for_physical_chassis_intrusion_log_body,
emits_alert_for_intrusion_message_args,
emits_success_for_cleared_intrusion_event, and
emits_success_for_reset_intrusion_event all call the same operation
emitted_report with different log inputs and verify different aspects of the
result. Consolidate these four test functions into a single table-driven test
using the value_scenarios! macro, where each row in the table represents one
test case with its input parameters and expected assertions. This will reduce
duplication and make it easier to add new edge cases in the future.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b077eb08-ae43-41a1-9373-34b63ac741d6

📥 Commits

Reviewing files that changed from the base of the PR and between 86dd05c and a6864dd.

📒 Files selected for processing (6)

crates/health/src/lib.rs
crates/health/src/processor/intrusion_events.rs
crates/health/src/processor/mod.rs
crates/health/src/sink/events.rs
deploy/nico-base/hardware-health/deployment.yaml
helm/charts/nico-hardware-health/values.yaml

copy-pr-bot · 2026-06-23T05:10:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

github-actions · 2026-06-23T05:18:30Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	264	6	23	99	6	130
machine-validation-runner	714	34	183	266	35	196
machine_validation	714	34	183	266	35	196
nvmetal-carbide	714	34	183	266	35	196
TOTAL	2412	108	572	903	111	718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

fix(health): report BMC intrusion events

a6864dd

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu requested review from a team as code owners June 23, 2026 04:34

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread crates/health/src/processor/intrusion_events.rs Outdated

osu marked this pull request as draft June 23, 2026 05:10

osu self-assigned this Jun 23, 2026

fix(health): address intrusion event review

bd8e0e9

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu marked this pull request as ready for review June 23, 2026 05:15

Matthias247 assigned yoks Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(health): report BMC intrusion events#2782

fix(health): report BMC intrusion events#2782
osu wants to merge 2 commits into
NVIDIA:mainfrom
osu:fix/2671-bmc-intrusion-health-report

osu commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

osu commented Jun 23, 2026

Summary

Validation

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading