Skip to content

fix: record workflow gRPC call latency for monitoring#2702

Open
rsd-darshan wants to merge 3 commits into
NVIDIA:mainfrom
rsd-darshan:fix/record-workflow-latency-metrics
Open

fix: record workflow gRPC call latency for monitoring#2702
rsd-darshan wants to merge 3 commits into
NVIDIA:mainfrom
rsd-darshan:fix/record-workflow-latency-metrics

Conversation

@rsd-darshan

@rsd-darshan rsd-darshan commented Jun 19, 2026

Copy link
Copy Markdown

Fixes #2649

Summary

Adds latency measurement and logging to gRPC calls in workflow activities. Workflow latency metrics were previously defined in site-agent's coregrpc manager but never actually recorded. This change instruments the activities to measure and log the duration of each gRPC call to NICo.

Changes

  • Introduces latency.go with a shared helper logGrpcCallLatency() to standardize duration logging across activities
  • Instruments vpcprefix.go activities (Create, Update, Delete) to measure gRPC call duration
  • Instruments instance.go UpdateInstanceConfig activity with latency measurement
  • Logs include operation name, duration, and error state for observability

Benefits

  • Visibility into API response times for troubleshooting performance issues
  • Foundation for future metric recording to Prometheus/telemetry systems
  • Enables monitoring of workflow SLA compliance and identifying slow operations

Testing

All existing tests pass. Verified that duration logging appears in test output with actual measured latencies.

@rsd-darshan rsd-darshan requested a review from a team as a code owner June 19, 2026 11:24
@copy-pr-bot

copy-pr-bot Bot commented Jun 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • Refactor
    • Improved operational observability by timing Core gRPC calls across site workflow CRUD actions and reporting call duration via centralized latency logging.
    • Standardized gRPC error handling by removing duplicate per-call warning logs and “completed” info messages while still surfacing failures.
  • Chores
    • Enhanced backend logging consistency across additional site workflow activities, including DPUs, expected machine/rack/switch/power shelf, InfiniBand partitions, OS images, SSH key groups, and maintenance-related machine health reports.

Walkthrough

A new logGrpcCallLatency helper is introduced that encapsulates latency instrumentation with conditional logging: warn-level output on failure (including error context) and debug-level output on success. This helper is then applied uniformly across 17 site-workflow activity files to time and log all Core gRPC calls, replacing scattered per-call logger.Warn failure logs and logger.Info("Completed activity") success logs with centralized duration measurement and reporting.

Changes

gRPC Call Latency Instrumentation

Layer / File(s) Summary
logGrpcCallLatency helper
rest-api/site-workflow/pkg/activity/latency.go
Introduces a small, reusable helper accepting a zerolog.Logger, operation name, time.Duration, and error; emits a warn-level log including the operation name, duration, and error on failure, and a debug-level log with operation name and duration on success.
Latency instrumentation applied across all activity call sites
rest-api/site-workflow/pkg/activity/dpuextensionservice.go, expectedmachine.go, expectedpowershelf.go, expectedrack.go, expectedswitch.go, ibpartition.go, instance.go, instancetype.go, machine.go, networksecuritygroup.go, nvlinklogicalpartition.go, operatingsystem.go, sshkeygroup.go, subnet.go, tenant.go, vpc.go, vpcpeering.go, vpcprefix.go
Every Core gRPC call site is wrapped with start := time.Now() and duration := time.Since(start) before invoking logGrpcCallLatency. The previous inline logger.Warn(...) on failure and logger.Info("Completed activity") on success are removed from all 17 files. Error paths continue to return swe.WrapErr(err) directly; success paths return existing result values or nil without explicit completion logging.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

low risk

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the primary change: adding latency recording for gRPC calls in workflow activities for monitoring purposes.
Description check ✅ Passed The description is well-structured and directly related to the changeset, providing context, scope, and rationale for the instrumentation work across multiple workflow activity files.
Linked Issues check ✅ Passed The PR successfully addresses issue #2649 by instrumenting all mutating Site Controller gRPC calls across 19 workflow activity files to record previously-defined but unused latency metrics.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #2649's objective to instrument gRPC calls for latency measurement; read-only operations and Flow gRPC calls are intentionally excluded per architectural decisions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Adds latency measurement and logging to gRPC calls in workflow activities. Introduces a shared helper function to standardize how gRPC call durations are logged across activities, enabling better observability of workflow performance.

This addresses issue NVIDIA#2649 where workflow latency metrics were defined but not recorded. The duration is now captured and logged for each gRPC call, providing visibility into API response times for troubleshooting and performance monitoring.

Fixes NVIDIA#2649

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>
@rsd-darshan rsd-darshan force-pushed the fix/record-workflow-latency-metrics branch from b4fbd86 to 0ae41ea Compare June 19, 2026 11:30
@ajf

ajf commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@rsd-darshan I'm only seeing one file getting changed. Intentional?

Extends the logGrpcCallLatency helper to every mutating Site Controller
gRPC call (Create/Update/Delete/Add/Remove/Insert) across the workflow
activities, not just the two introduced earlier. Each call now measures
its duration and logs it alongside the operation name and error state.

Read-only inventory lookups (Find/Get/Search) are left untouched since
they're invoked through a different, generic code path used by the
inventory collection helpers. Calls that go to the Flow gRPC client
(AddComponent, CreateExpectedRackOnFlow) are also left untouched since
they target a different downstream service.

Signed-off-by: rsd-darshan <poudeldarshan44@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
rest-api/site-workflow/pkg/activity/tenant.go (2)

54-62: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Prefer explicit return nil for consistency with other activities.

After successful operations, line 62 returns err (which is nil at that point) instead of explicitly returning nil. While functionally correct, this pattern is inconsistent with all other activity files in this PR and the broader codebase, which use explicit return nil on success paths.

♻️ Proposed change for consistency
 	start := time.Now()
 	_, err = grpcServiceClient.CreateTenant(ctx, request)
 	duration := time.Since(start)
 	logGrpcCallLatency(&logger, "CreateTenant", duration, err)
 	if err != nil {
 		return swe.WrapErr(err)
 	}
 
-	return err
+	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/tenant.go` around lines 54 - 62, In the
CreateTenant function, the final return statement at line 62 returns the err
variable (which is nil at that point) instead of explicitly returning nil.
Replace the return statement after the error handling block with an explicit
return nil to maintain consistency with other activity files in the codebase and
follow the established pattern for successful operation completion.

91-99: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Prefer explicit return nil for consistency with other activities.

After successful operations, line 99 returns err (which is nil at that point) instead of explicitly returning nil. While functionally correct, this pattern is inconsistent with all other activity files in this PR and the broader codebase, which use explicit return nil on success paths.

♻️ Proposed change for consistency
 	start := time.Now()
 	_, err = grpcServiceClient.UpdateTenant(ctx, request)
 	duration := time.Since(start)
 	logGrpcCallLatency(&logger, "UpdateTenant", duration, err)
 	if err != nil {
 		return swe.WrapErr(err)
 	}
 
-	return err
+	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/tenant.go` around lines 91 - 99, In the
UpdateTenant activity function, change the final return statement from `return
err` to `return nil` on line 99. After the successful UpdateTenant operation
when no error has occurred, explicitly return nil instead of returning the err
variable (which is nil at that point) to maintain consistency with other
activity files in the codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rest-api/site-workflow/pkg/activity/latency.go`:
- Around line 15-17: The successful operation latency logging in the else block
is currently using logger.Debug() which will not appear in typical production
log configurations, preventing real-time SLA monitoring and latency analysis.
Change the log level from Debug to Info in the else block where the duration and
"Completed %s" operation message are logged to ensure these latency measurements
are visible in production environments and enable proper SLA compliance
monitoring.

---

Nitpick comments:
In `@rest-api/site-workflow/pkg/activity/tenant.go`:
- Around line 54-62: In the CreateTenant function, the final return statement at
line 62 returns the err variable (which is nil at that point) instead of
explicitly returning nil. Replace the return statement after the error handling
block with an explicit return nil to maintain consistency with other activity
files in the codebase and follow the established pattern for successful
operation completion.
- Around line 91-99: In the UpdateTenant activity function, change the final
return statement from `return err` to `return nil` on line 99. After the
successful UpdateTenant operation when no error has occurred, explicitly return
nil instead of returning the err variable (which is nil at that point) to
maintain consistency with other activity files in the codebase.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c52a219-69d7-415a-9ef0-bec7d3ff658b

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and 8427f7c.

📒 Files selected for processing (19)
  • rest-api/site-workflow/pkg/activity/dpuextensionservice.go
  • rest-api/site-workflow/pkg/activity/expectedmachine.go
  • rest-api/site-workflow/pkg/activity/expectedpowershelf.go
  • rest-api/site-workflow/pkg/activity/expectedrack.go
  • rest-api/site-workflow/pkg/activity/expectedswitch.go
  • rest-api/site-workflow/pkg/activity/ibpartition.go
  • rest-api/site-workflow/pkg/activity/instance.go
  • rest-api/site-workflow/pkg/activity/instancetype.go
  • rest-api/site-workflow/pkg/activity/latency.go
  • rest-api/site-workflow/pkg/activity/machine.go
  • rest-api/site-workflow/pkg/activity/networksecuritygroup.go
  • rest-api/site-workflow/pkg/activity/nvlinklogicalpartition.go
  • rest-api/site-workflow/pkg/activity/operatingsystem.go
  • rest-api/site-workflow/pkg/activity/sshkeygroup.go
  • rest-api/site-workflow/pkg/activity/subnet.go
  • rest-api/site-workflow/pkg/activity/tenant.go
  • rest-api/site-workflow/pkg/activity/vpc.go
  • rest-api/site-workflow/pkg/activity/vpcpeering.go
  • rest-api/site-workflow/pkg/activity/vpcprefix.go

Comment on lines +15 to +17
} else {
logger.Debug().Dur("duration", duration).Msgf("Completed %s", operation)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Success logging at Debug level prevents production SLA monitoring.

The PR objectives state this change "enables monitoring of workflow SLA compliance and identification of slow operations." However, logging successful operations at Debug level means these latency measurements won't appear in typical production log configurations (which run at Info or Warn level). This prevents real-time SLA monitoring and latency analysis for the success path, which is critical for performance observability.

Consider changing line 16 to log at Info level to ensure successful operation latencies are visible in production:

📊 Proposed fix to enable production monitoring
 func logGrpcCallLatency(logger *zerolog.Logger, operation string, duration time.Duration, err error) {
 	if err != nil {
 		logger.Warn().Err(err).Dur("duration", duration).Msgf("Failed to %s using Site Controller API", operation)
 	} else {
-		logger.Debug().Dur("duration", duration).Msgf("Completed %s", operation)
+		logger.Info().Dur("duration", duration).Msgf("Completed %s", operation)
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/latency.go` around lines 15 - 17, The
successful operation latency logging in the else block is currently using
logger.Debug() which will not appear in typical production log configurations,
preventing real-time SLA monitoring and latency analysis. Change the log level
from Debug to Info in the else block where the duration and "Completed %s"
operation message are logged to ensure these latency measurements are visible in
production environments and enable proper SLA compliance monitoring.

@rsd-darshan

Copy link
Copy Markdown
Author

@ajf Good catch, that was incomplete. Pushed an update that wires the logGrpcCallLatency helper into every mutating Site Controller gRPC call (Create/Update/Delete/Add/Remove/Insert) across the workflow activities — 19 files now.

Read-only lookups (Find/Get/Search) and calls bound for the Flow gRPC client (e.g. AddComponent) are intentionally left out since they go through a different code path / downstream service.

@ianderson-nvidia

ianderson-nvidia commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@ianderson-nvidia: Understood. I will perform a full review of the PR, covering all 19 instrumented activity files and the new latency.go helper in their updated state.

✅ Action performed

Full review finished.

@ianderson-nvidia

Copy link
Copy Markdown
Contributor

/ok to test 8427f7c

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

No Grype artifacts were found to aggregate.

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-23 23:05:46 UTC | Commit: 8427f7c

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/activity/latency.go (1)

12-17: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured operation fields instead of Msgf interpolation in latency logs.

Line 14 and Line 16 currently embed operation inside the log message, which is harder to query and aggregate than explicit fields.

♻️ Proposed refactor
 func logGrpcCallLatency(logger *zerolog.Logger, operation string, duration time.Duration, err error) {
 	if err != nil {
-		logger.Warn().Err(err).Dur("duration", duration).Msgf("Failed to %s using Site Controller API", operation)
+		logger.Warn().
+			Err(err).
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call failed")
 	} else {
-		logger.Debug().Dur("duration", duration).Msgf("Completed %s", operation)
+		logger.Debug().
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call completed")
 	}
 }

As per coding guidelines, "When writing log messages, prefer placing common fields as attributes passed to tracing function, instead of using string interpolation."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/latency.go` around lines 12 - 17, In the
logGrpcCallLatency function, the operation parameter is currently embedded in
log messages using Msgf string interpolation on both the error path (Warn log)
and success path (Debug log). Instead, add operation as a structured field by
chaining .Str("operation", operation) before calling .Msg() with a static
message string. Replace both the Msgf calls with Msg calls that use the
structured operation field, which will improve log queryability and aggregation.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rest-api/site-workflow/pkg/activity/expectedswitch.go`:
- Around line 286-289: The logGrpcCallLatency helper function in
rest-api/site-workflow/pkg/activity/latency.go logs successful gRPC calls at
Debug level, which suppresses them in production deployments running at Info or
Warn level. Modify the logGrpcCallLatency function to log success cases at Info
level instead of Debug level, and use structured fields for consistency with
logging conventions. This single change will fix all 40+ call sites across the
activity package (including the AddExpectedSwitch call and others in
expectedswitch, vpc, tenant, subnet, instance, machine, and other files) by
ensuring successful latency records are visible in production for proper
observability and SLA visibility.

In `@rest-api/site-workflow/pkg/activity/networksecuritygroup.go`:
- Around line 59-62: The logGrpcCallLatency function in latency.go is logging
success-case latency metrics at Debug level, which makes them invisible at the
default Info log level used in production. Locate the logGrpcCallLatency
function definition in latency.go and change the logging level for successful
gRPC call completions from Debug to Info level, while keeping error logs at
their current level if appropriate. This will ensure that operation latency
metrics are visible in production logs without requiring debug logging to be
explicitly enabled. The function is called from three locations (around lines
59-62, 99-102, and 139-142 in networksecuritygroup.go), but you only need to fix
it once in the function definition itself.

---

Nitpick comments:
In `@rest-api/site-workflow/pkg/activity/latency.go`:
- Around line 12-17: In the logGrpcCallLatency function, the operation parameter
is currently embedded in log messages using Msgf string interpolation on both
the error path (Warn log) and success path (Debug log). Instead, add operation
as a structured field by chaining .Str("operation", operation) before calling
.Msg() with a static message string. Replace both the Msgf calls with Msg calls
that use the structured operation field, which will improve log queryability and
aggregation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 874a244a-c631-4712-8460-907c9dfbf798

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and 8427f7c.

📒 Files selected for processing (19)
  • rest-api/site-workflow/pkg/activity/dpuextensionservice.go
  • rest-api/site-workflow/pkg/activity/expectedmachine.go
  • rest-api/site-workflow/pkg/activity/expectedpowershelf.go
  • rest-api/site-workflow/pkg/activity/expectedrack.go
  • rest-api/site-workflow/pkg/activity/expectedswitch.go
  • rest-api/site-workflow/pkg/activity/ibpartition.go
  • rest-api/site-workflow/pkg/activity/instance.go
  • rest-api/site-workflow/pkg/activity/instancetype.go
  • rest-api/site-workflow/pkg/activity/latency.go
  • rest-api/site-workflow/pkg/activity/machine.go
  • rest-api/site-workflow/pkg/activity/networksecuritygroup.go
  • rest-api/site-workflow/pkg/activity/nvlinklogicalpartition.go
  • rest-api/site-workflow/pkg/activity/operatingsystem.go
  • rest-api/site-workflow/pkg/activity/sshkeygroup.go
  • rest-api/site-workflow/pkg/activity/subnet.go
  • rest-api/site-workflow/pkg/activity/tenant.go
  • rest-api/site-workflow/pkg/activity/vpc.go
  • rest-api/site-workflow/pkg/activity/vpcpeering.go
  • rest-api/site-workflow/pkg/activity/vpcprefix.go

Comment on lines +286 to +289
start := time.Now()
_, err = grpcServiceClient.AddExpectedSwitch(ctx, request)
duration := time.Since(start)
logGrpcCallLatency(&logger, "AddExpectedSwitch", duration, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify how site-workflow log level is configured/defaulted.
# Expected: if default/effective level is Info or higher, Debug success-latency logs are not emitted.
rg -n --type go -C3 'SetGlobalLevel|ParseLevel|WithLevel|LOG_LEVEL|zerolog' rest-api/site-workflow rest-api/common

# Show all current helper call sites touched by this PR cohort.
rg -n --type go 'logGrpcCallLatency\(' rest-api/site-workflow/pkg/activity

Repository: NVIDIA/infra-controller

Length of output: 47687


🏁 Script executed:

# Read the latency.go file to see the actual log level used in success path
sed -n '12,20p' rest-api/site-workflow/pkg/activity/latency.go

Repository: NVIDIA/infra-controller

Length of output: 480


Successful gRPC latency logs are suppressed in production deployments.

The logGrpcCallLatency helper in rest-api/site-workflow/pkg/activity/latency.go logs success cases at Debug level. In production deployments running at Info or Warn (standard configurations), these successful latency records are discarded, undermining observability and SLA visibility for normal traffic. Error cases remain visible at Warn level, creating an asymmetry where failures are instrumented but successes are not.

The fix elevates success logging to Info and uses structured fields for consistency with logging conventions:

Suggested change in `rest-api/site-workflow/pkg/activity/latency.go`
 func logGrpcCallLatency(logger *zerolog.Logger, operation string, duration time.Duration, err error) {
 	if err != nil {
-		logger.Warn().Err(err).Dur("duration", duration).Msgf("Failed to %s using Site Controller API", operation)
+		logger.Warn().
+			Err(err).
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call failed")
 	} else {
-		logger.Debug().Dur("duration", duration).Msgf("Completed %s", operation)
+		logger.Info().
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call completed")
 	}
 }

This affects 40+ call sites across rest-api/site-workflow/pkg/activity/*.go (expectedswitch, vpc, tenant, subnet, instance, machine, and others).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
start := time.Now()
_, err = grpcServiceClient.AddExpectedSwitch(ctx, request)
duration := time.Since(start)
logGrpcCallLatency(&logger, "AddExpectedSwitch", duration, err)
func logGrpcCallLatency(logger *zerolog.Logger, operation string, duration time.Duration, err error) {
if err != nil {
logger.Warn().
Err(err).
Str("operation", operation).
Dur("duration", duration).
Msg("Site Controller API call failed")
} else {
logger.Info().
Str("operation", operation).
Dur("duration", duration).
Msg("Site Controller API call completed")
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/expectedswitch.go` around lines 286 -
289, The logGrpcCallLatency helper function in
rest-api/site-workflow/pkg/activity/latency.go logs successful gRPC calls at
Debug level, which suppresses them in production deployments running at Info or
Warn level. Modify the logGrpcCallLatency function to log success cases at Info
level instead of Debug level, and use structured fields for consistency with
logging conventions. This single change will fix all 40+ call sites across the
activity package (including the AddExpectedSwitch call and others in
expectedswitch, vpc, tenant, subnet, instance, machine, and other files) by
ensuring successful latency records are visible in production for proper
observability and SLA visibility.

Comment on lines +59 to +62
start := time.Now()
_, err = rpcServiceClient.CreateNetworkSecurityGroup(ctx, request)
duration := time.Since(start)
logGrpcCallLatency(&logger, "CreateNetworkSecurityGroup", duration, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate latency helper usage and definition
rg -n "func logGrpcCallLatency|logGrpcCallLatency\(" rest-api/site-workflow/pkg/activity

# 2) Inspect configured log levels in site-workflow and shared logging/config code
rg -n "SetGlobalLevel|GlobalLevel|LOG_LEVEL|log_level|zerolog\\.(DebugLevel|InfoLevel|WarnLevel)" \
  rest-api/site-workflow rest-api/common

# 3) Find deployment/runtime env defaults for log level (if any)
rg -n "LOG_LEVEL|log-level|log_level" rest-api/deploy rest-api/helm rest-api/docker

Repository: NVIDIA/infra-controller

Length of output: 7842


🏁 Script executed:

# Inspect the latency helper implementation
cat -n rest-api/site-workflow/pkg/activity/latency.go

# Search for log level initialization in site-workflow
rg -n "zerolog|SetGlobalLevel|logger\.SetLevel" rest-api/site-workflow/pkg --max-depth=5

# Check main.go or similar entrypoint for log setup
find rest-api/site-workflow -name "main.go" -o -name "init.go" | head -5

Repository: NVIDIA/infra-controller

Length of output: 6104


🏁 Script executed:

# Find site-workflow main entry point
find rest-api/site-workflow -name "main.go" -o -name "cmd" -type d

# Check what's in the cmd or main area
ls -la rest-api/site-workflow/cmd/ 2>/dev/null || ls -la rest-api/site-workflow/ | grep -E "main|cmd"

# Search for log initialization in common package
rg -n "log\.|Logger|zerolog" rest-api/common/pkg/config --max-depth=3 -A 3

# Check site-workflow main directly
find rest-api/site-workflow -maxdepth 3 -name "*.go" -path "*/cmd/*" -o -name "main.go" | xargs head -50

Repository: NVIDIA/infra-controller

Length of output: 161


🏁 Script executed:

# Find where site-workflow package is imported/used
rg -n "site-workflow|site_workflow" rest-api --glob="*.go" | grep -v "test" | head -20

# Check the main site-workflow directory structure
ls -la rest-api/site-workflow/

# Search for logging setup in workflow initialization
rg -n "log\.Level|SetLevel|GlobalLevel" rest-api/workflow rest-api/site-workflow -A 2

# Check Helm/Kustomize for LOG_LEVEL env var defaults
rg -n "LOG_LEVEL|log.level" rest-api/helm rest-api/deploy -i

Repository: NVIDIA/infra-controller

Length of output: 3813


🏁 Script executed:

# Check workflow's log level configuration
sed -n '250,280p' rest-api/workflow/internal/config/config.go

# Look for log level default in workflow config
rg -n "InfoLevel|DebugLevel|log.*level|LOG_LEVEL" rest-api/workflow/internal/config -B 2 -A 2

# Check site-workflow package to see if it sets up logging
rg -n "SetGlobalLevel|log\.Level|DEBUG|INFO" rest-api/site-workflow/pkg -i | head -20

# Check how site-agent (which runs site-workflow) initializes logging
rg -n "log\.|zerolog" rest-api/site-agent/cmd rest-api/site-agent/internal -i | grep -i "level\|init\|setup" | head -10

Repository: NVIDIA/infra-controller

Length of output: 5375


🏁 Script executed:

# Look for default log level in config defaults
rg -n "log\.level|ConfigLogLevel" rest-api/workflow/internal/config -B 5 -A 5 | head -50

# Check viper defaults or config files
rg -n "log.*level|info|debug" rest-api/common/pkg/config --glob="*.go" -i

# Check if there's a default in GetLogLevel
rg -n "GetLogLevel" rest-api/workflow/internal/config -A 5

# Look for environment defaults
rg -n "log.*level.*info|log.*level.*debug|LOG_LEVEL.*=" rest-api -i | head -20

Repository: NVIDIA/infra-controller

Length of output: 5665


Success-latency logs for gRPC calls now log at Debug level, making them invisible at the default Info log level.

Lines 59–62, 99–102, and 139–142 route completion latency through logGrpcCallLatency, which logs success at Debug level (latency.go:16). The workflow service defaults to log.level: "info" (workflow/internal/config/config.go:129), so these latency metrics will not appear in production logs unless debug logging is explicitly enabled—a configuration most deployments do not use.

Promote success-latency logs to Info level to preserve runtime visibility of operation performance, or document that debug logging must be enabled for latency observability.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/networksecuritygroup.go` around lines 59
- 62, The logGrpcCallLatency function in latency.go is logging success-case
latency metrics at Debug level, which makes them invisible at the default Info
log level used in production. Locate the logGrpcCallLatency function definition
in latency.go and change the logging level for successful gRPC call completions
from Debug to Info level, while keeping error logs at their current level if
appropriate. This will ensure that operation latency metrics are visible in
production logs without requiring debug logging to be explicitly enabled. The
function is called from three locations (around lines 59-62, 99-102, and 139-142
in networksecuritygroup.go), but you only need to fix it once in the function
definition itself.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/activity/latency.go (1)

12-17: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use structured log fields instead of formatted messages in the latency helper.

Msgf embeds operation context in free text, which makes querying less reliable. Emit operation as a structured field and use Msg(...) for both branches.

♻️ Proposed refinement
 func logGrpcCallLatency(logger *zerolog.Logger, operation string, duration time.Duration, err error) {
 	if err != nil {
-		logger.Warn().Err(err).Dur("duration", duration).Msgf("Failed to %s using Site Controller API", operation)
+		logger.Warn().
+			Err(err).
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call failed")
 	} else {
-		logger.Debug().Dur("duration", duration).Msgf("Completed %s", operation)
+		logger.Debug().
+			Str("operation", operation).
+			Dur("duration", duration).
+			Msg("Site Controller API call completed")
 	}
 }

As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing function, instead of using string interpolation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/activity/latency.go` around lines 12 - 17, The
logGrpcCallLatency function currently embeds the operation context in free-text
messages using Msgf, which reduces log query reliability. Refactor this function
to emit operation as a structured field instead. Add a `.Str("operation",
operation)` field to both the error and success log entries, then replace both
Msgf calls with Msg calls using plain text messages that do not interpolate the
operation variable. This applies to both branches of the if-else statement.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@rest-api/site-workflow/pkg/activity/latency.go`:
- Around line 12-17: The logGrpcCallLatency function currently embeds the
operation context in free-text messages using Msgf, which reduces log query
reliability. Refactor this function to emit operation as a structured field
instead. Add a `.Str("operation", operation)` field to both the error and
success log entries, then replace both Msgf calls with Msg calls using plain
text messages that do not interpolate the operation variable. This applies to
both branches of the if-else statement.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d8113186-8387-4296-8a0e-c024bb35d427

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and 8427f7c.

📒 Files selected for processing (19)
  • rest-api/site-workflow/pkg/activity/dpuextensionservice.go
  • rest-api/site-workflow/pkg/activity/expectedmachine.go
  • rest-api/site-workflow/pkg/activity/expectedpowershelf.go
  • rest-api/site-workflow/pkg/activity/expectedrack.go
  • rest-api/site-workflow/pkg/activity/expectedswitch.go
  • rest-api/site-workflow/pkg/activity/ibpartition.go
  • rest-api/site-workflow/pkg/activity/instance.go
  • rest-api/site-workflow/pkg/activity/instancetype.go
  • rest-api/site-workflow/pkg/activity/latency.go
  • rest-api/site-workflow/pkg/activity/machine.go
  • rest-api/site-workflow/pkg/activity/networksecuritygroup.go
  • rest-api/site-workflow/pkg/activity/nvlinklogicalpartition.go
  • rest-api/site-workflow/pkg/activity/operatingsystem.go
  • rest-api/site-workflow/pkg/activity/sshkeygroup.go
  • rest-api/site-workflow/pkg/activity/subnet.go
  • rest-api/site-workflow/pkg/activity/tenant.go
  • rest-api/site-workflow/pkg/activity/vpc.go
  • rest-api/site-workflow/pkg/activity/vpcpeering.go
  • rest-api/site-workflow/pkg/activity/vpcprefix.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: site-agent defines but does not record workflow metrics

3 participants