Skip to content

Flush telemetry in a detached subprocess#275

Merged
carole-lavillonniere merged 1 commit into
mainfrom
flc-648-investigate-lstk-aws-overhead-vs-awslocal
Jun 2, 2026
Merged

Flush telemetry in a detached subprocess#275
carole-lavillonniere merged 1 commit into
mainfrom
flc-648-investigate-lstk-aws-overhead-vs-awslocal

Conversation

@carole-lavillonniere
Copy link
Copy Markdown
Collaborator

@carole-lavillonniere carole-lavillonniere commented Jun 2, 2026

Motivation

lstk aws took ~2x as long as awslocal for simple calls (~1.6s vs ~0.75s). The overhead came from telemetry.Client.Close() blocking on HTTP POSTs to the analytics endpoint before process exit (~900ms confirmed via LOCALSTACK_DISABLE_EVENTS=1). As discussed on the ticket, the flush now happens in a detached subprocess after the CLI returns, so analytics latency no longer affects command responsiveness.

Changes

  • telemetry.Client buffers events in memory; Close() hands them to a detached subprocess and returns immediately instead of draining HTTP requests
    • The buffer is capped at 64 events (pendingCap) as a safety bound — a single CLI run emits only 1–3 events in practice. On overflow the oldest event is dropped, keeping the most valuable one: the final lstk_command event emitted after the command completes. (The previous channel-based implementation had the same cap but dropped the newest event on overflow.)
  • New __flush-telemetry mode reads JSON-line events from stdin and POSTs them, bounded by a 10s wall-clock cap so orphaned flushers can't linger
    • It is short-circuited at the top of Execute(), before logger/keyring/telemetry/cobra init: the flusher writes nothing to the filesystem (no lstk.log, no config dir creation — which raced with test temp-home cleanup on macOS)
  • Platform-specific detach: Setsid on Unix, windows.DETACHED_PROCESS | windows.CREATE_NEW_PROCESS_GROUP (via golang.org/x/sys) on Windows
  • Trace context is propagated to the flusher via TRACEPARENT/TRACESTATE env vars, so its telemetry flush and telemetry POST spans join the originating command's trace
  • The flusher never constructs a telemetry client or cobra command, so it structurally cannot emit events about itself or spawn another flusher recursively

Tests

  • TestStartCommand_DoesNotBlockOnSlowAnalyticsEndpoint: parent exits in <1s while the endpoint hangs 3s, and the event is still delivered
  • TestCommandTelemetryIsDeliveredByDetachedFlusher: Docker-free end-to-end flush — also runs on Windows CI, covering the Windows detach flags (the Docker-based telemetry tests skip there)
  • TestFlushTelemetrySubcommandDoesNotSpawnRecursively: guards against recursive flusher spawning
  • TestOtelFlushSpansJoinCommandTrace: flush spans from the subprocess reach the OTLP collector
  • Unit tests for event buffering/handoff (including cap/drop-oldest behavior) and the TRACEPARENT round-trip (trace/span ID equality)
  • Manually verified in Jaeger: telemetry flush appears as CHILD_OF the command span in a single trace; lstk aws sts get-caller-identity-style commands no longer wait on analytics

Closes FLC-648

Tracing

The fact that the telemetry event is emitted after the main command's process returned is clearly visible on this trace:
image

@carole-lavillonniere carole-lavillonniere changed the title Flush telemetry in a detached subprocess to keep commands fast Flush telemetry in a detached subprocess Jun 2, 2026
@carole-lavillonniere carole-lavillonniere force-pushed the flc-648-investigate-lstk-aws-overhead-vs-awslocal branch 2 times, most recently from ce3259b to d49b5c2 Compare June 2, 2026 09:50
@carole-lavillonniere carole-lavillonniere force-pushed the flc-648-investigate-lstk-aws-overhead-vs-awslocal branch from d49b5c2 to 38b6d38 Compare June 2, 2026 10:11
@carole-lavillonniere carole-lavillonniere marked this pull request as ready for review June 2, 2026 10:26
Copy link
Copy Markdown
Collaborator

@anisaoshafi anisaoshafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this and the thorough testing and PR description🥇

@carole-lavillonniere carole-lavillonniere merged commit efa6f19 into main Jun 2, 2026
12 checks passed
@carole-lavillonniere carole-lavillonniere deleted the flc-648-investigate-lstk-aws-overhead-vs-awslocal branch June 2, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants