Flush telemetry in a detached subprocess#275
Merged
carole-lavillonniere merged 1 commit intoJun 2, 2026
Merged
Conversation
ce3259b to
d49b5c2
Compare
d49b5c2 to
38b6d38
Compare
anisaoshafi
approved these changes
Jun 2, 2026
Collaborator
anisaoshafi
left a comment
There was a problem hiding this comment.
Thanks for tackling this and the thorough testing and PR description🥇
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
lstk awstook ~2x as long asawslocalfor simple calls (~1.6s vs ~0.75s). The overhead came fromtelemetry.Client.Close()blocking on HTTP POSTs to the analytics endpoint before process exit (~900ms confirmed viaLOCALSTACK_DISABLE_EVENTS=1). As discussed on the ticket, the flush now happens in a detached subprocess after the CLI returns, so analytics latency no longer affects command responsiveness.Changes
telemetry.Clientbuffers events in memory;Close()hands them to a detached subprocess and returns immediately instead of draining HTTP requestspendingCap) as a safety bound — a single CLI run emits only 1–3 events in practice. On overflow the oldest event is dropped, keeping the most valuable one: the finallstk_commandevent emitted after the command completes. (The previous channel-based implementation had the same cap but dropped the newest event on overflow.)__flush-telemetrymode reads JSON-line events from stdin and POSTs them, bounded by a 10s wall-clock cap so orphaned flushers can't lingerExecute(), before logger/keyring/telemetry/cobra init: the flusher writes nothing to the filesystem (nolstk.log, no config dir creation — which raced with test temp-home cleanup on macOS)Setsidon Unix,windows.DETACHED_PROCESS | windows.CREATE_NEW_PROCESS_GROUP(viagolang.org/x/sys) on WindowsTRACEPARENT/TRACESTATEenv vars, so itstelemetry flushandtelemetry POSTspans join the originating command's traceTests
TestStartCommand_DoesNotBlockOnSlowAnalyticsEndpoint: parent exits in <1s while the endpoint hangs 3s, and the event is still deliveredTestCommandTelemetryIsDeliveredByDetachedFlusher: Docker-free end-to-end flush — also runs on Windows CI, covering the Windows detach flags (the Docker-based telemetry tests skip there)TestFlushTelemetrySubcommandDoesNotSpawnRecursively: guards against recursive flusher spawningTestOtelFlushSpansJoinCommandTrace: flush spans from the subprocess reach the OTLP collectorTRACEPARENTround-trip (trace/span ID equality)telemetry flushappears asCHILD_OFthe command span in a single trace;lstk aws sts get-caller-identity-style commands no longer wait on analyticsCloses FLC-648
Tracing
The fact that the telemetry event is emitted after the main command's process returned is clearly visible on this trace:
