Skip to content

fix(health): report BMC intrusion events#2782

Open
osu wants to merge 2 commits into
NVIDIA:mainfrom
osu:fix/2671-bmc-intrusion-health-report
Open

fix(health): report BMC intrusion events#2782
osu wants to merge 2 commits into
NVIDIA:mainfrom
osu:fix/2671-bmc-intrusion-health-report

Conversation

@osu

@osu osu commented Jun 23, 2026

Copy link
Copy Markdown
Member

Summary

  • detect BMC log events for physical chassis intrusion and emit IntrusionSensorTriggered machine health reports
  • add hardware-health.bmc-events source plus PreventAllocations/SensorCritical classifications for the generated alert
  • enable hardware-health BMC log collection in Helm and deploy manifests so the detector receives intrusion events

Validation

  • PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health intrusion --lib
  • PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health sink::events --lib
  • PATH="$HOME/.rustup/toolchains/1.96.0-aarch64-apple-darwin/bin:$PATH" cargo test -p carbide-health --lib
  • rustup run nightly-2026-06-16 rustfmt --check crates/health/src/processor/intrusion_events.rs crates/health/src/processor/mod.rs crates/health/src/lib.rs crates/health/src/sink/events.rs
  • git diff --check
  • helm lint helm/charts/nico-hardware-health
  • helm template nico-hardware-health helm/charts/nico-hardware-health
  • kubectl kustomize deploy/nico-base/hardware-health

Backend follow-up for #2671; complements #2772, which surfaces the resulting alert details on /admin/managed-host.

Signed-off-by: Hasan Khan <hasank@nvidia.com>
@osu osu requested review from a team as code owners June 23, 2026 04:34
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 66466943-4f13-490d-ac4d-6e341c09442f

📥 Commits

Reviewing files that changed from the base of the PR and between a6864dd and bd8e0e9.

📒 Files selected for processing (1)
  • crates/health/src/processor/intrusion_events.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/health/src/processor/intrusion_events.rs

Summary by CodeRabbit

  • New Features
    • Added chassis intrusion detection and health reporting. The system now detects physical intrusion events and generates health alerts when triggered, with automatic clearing when the intrusion threat is resolved.

Walkthrough

A new BmcIntrusionEventProcessor is introduced to translate BMC log records into health report alerts or successes by keyword-matching against the log body and optional message_args JSON. Three new enum variants are added to the health event sink. The processor is wired into the pipeline, and the logs collector is enabled in deployment configuration.

Changes

BMC Intrusion Health Reporting

Layer / File(s) Summary
Health event sink enum extensions
crates/health/src/sink/events.rs
Adds ReportSource::BmcEvents, Probe::IntrusionSensorTriggered, and Classification::PreventAllocations variants with as_str() mappings and corresponding unit test scenarios for string conversion and alert conversion output.
BmcIntrusionEventProcessor implementation and tests
crates/health/src/processor/intrusion_events.rs
Defines IntrusionEventState enum and keyword-matching constants, implements helper methods to parse message_args JSON, implements EventProcessor trait producing HealthReport with alert or success based on keyword classification, and provides comprehensive unit tests covering intrusion alert detection (via body and message_args), clear detection, and unrelated log filtering.
Processor module wiring and pipeline registration
crates/health/src/processor/mod.rs, crates/health/src/lib.rs
Declares the intrusion_events submodule, re-exports BmcIntrusionEventProcessor, imports it in lib.rs, and conditionally pushes it into the processor pipeline when the health_report sink is enabled.
Deployment configuration enabling logs collector
deploy/nico-base/hardware-health/deployment.yaml, helm/charts/nico-hardware-health/values.yaml
Sets NICO_HEALTH__COLLECTORS__LOGS__ENABLED to "true" in the Kubernetes deployment manifest and the Helm chart values file.

Sequence Diagram(s)

sequenceDiagram
    participant LogCollector
    participant BmcIntrusionEventProcessor
    participant HealthReportSink

    LogCollector->>BmcIntrusionEventProcessor: CollectorEvent::Log(record)
    BmcIntrusionEventProcessor->>BmcIntrusionEventProcessor: Extract body + message_args
    BmcIntrusionEventProcessor->>BmcIntrusionEventProcessor: intrusion_event_state() keyword match

    alt Alert keywords matched
        BmcIntrusionEventProcessor->>HealthReportSink: HealthReport { alert: IntrusionSensorTriggered, PreventAllocations }
    else Clear keywords matched
        BmcIntrusionEventProcessor->>HealthReportSink: HealthReport { success: IntrusionSensorTriggered }
    else No intrusion keywords
        BmcIntrusionEventProcessor-->>LogCollector: Empty output (no report)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/infra-controller#2772: Directly references the same IntrusionSensorTriggered probe and HostBMC target introduced in this PR, asserting the intrusion alert details are rendered on the managed-host page.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 39.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely and accurately describes the primary change: adding BMC intrusion event reporting to the health system.
Description check ✅ Passed The description provides relevant context about the changeset, including detection logic, classifications, deployment configuration, and validation steps performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/health/src/processor/intrusion_events.rs (1)

201-261: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consolidate intrusion classification tests into a table-driven suite.

These tests call the same operation with varied inputs; converting to value_scenarios! will reduce duplication and make edge-case additions (including ambiguous “normal + critical” phrasing) straightforward.

As per coding guidelines, “Prefer table-driven tests… Reach for a table whenever two or more tests call the same operation with different inputs.” As per path instructions, “crates/**/*.rs: … Prefer findings about behavior, concurrency, resource lifetimes, and missing tests over style-only comments.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/processor/intrusion_events.rs` around lines 201 - 261, The
tests emits_alert_for_physical_chassis_intrusion_log_body,
emits_alert_for_intrusion_message_args,
emits_success_for_cleared_intrusion_event, and
emits_success_for_reset_intrusion_event all call the same operation
emitted_report with different log inputs and verify different aspects of the
result. Consolidate these four test functions into a single table-driven test
using the value_scenarios! macro, where each row in the table represents one
test case with its input parameters and expected assertions. This will reduce
duplication and make it easier to add new edge cases in the future.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/src/processor/intrusion_events.rs`:
- Around line 76-93: The intrusion event classification logic checks for clear
keywords before alert keywords, causing messages containing both
"normal"/"reset" and alert signals like "critical" to be misclassified as Clear.
Reorder the conditional blocks in the intrusion event state detection logic so
that the alert check (which includes keywords like "physical chassis intrusion
alert", "trigger", "triggered", "assert", "alert", "critical", "warning") runs
before the clear check (which includes "clear", "cleared", "deassert", "normal",
"reset"). This ensures that more specific alert signals take priority and
prevents broad keywords like "normal" from causing false clears.

---

Nitpick comments:
In `@crates/health/src/processor/intrusion_events.rs`:
- Around line 201-261: The tests
emits_alert_for_physical_chassis_intrusion_log_body,
emits_alert_for_intrusion_message_args,
emits_success_for_cleared_intrusion_event, and
emits_success_for_reset_intrusion_event all call the same operation
emitted_report with different log inputs and verify different aspects of the
result. Consolidate these four test functions into a single table-driven test
using the value_scenarios! macro, where each row in the table represents one
test case with its input parameters and expected assertions. This will reduce
duplication and make it easier to add new edge cases in the future.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b077eb08-ae43-41a1-9373-34b63ac741d6

📥 Commits

Reviewing files that changed from the base of the PR and between 86dd05c and a6864dd.

📒 Files selected for processing (6)
  • crates/health/src/lib.rs
  • crates/health/src/processor/intrusion_events.rs
  • crates/health/src/processor/mod.rs
  • crates/health/src/sink/events.rs
  • deploy/nico-base/hardware-health/deployment.yaml
  • helm/charts/nico-hardware-health/values.yaml

Comment thread crates/health/src/processor/intrusion_events.rs Outdated
@osu osu marked this pull request as draft June 23, 2026 05:10
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@osu osu self-assigned this Jun 23, 2026
Signed-off-by: Hasan Khan <hasank@nvidia.com>
@osu osu marked this pull request as ready for review June 23, 2026 05:15
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 264 6 23 99 6 130
machine-validation-runner 714 34 183 266 35 196
machine_validation 714 34 183 266 35 196
nvmetal-carbide 714 34 183 266 35 196
TOTAL 2412 108 572 903 111 718

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants