gl-signerproxy: Fix concurrent HSM request blocking via message-passing dispatcher#727
Draft
cdecker wants to merge 2 commits into
Draft
gl-signerproxy: Fix concurrent HSM request blocking via message-passing dispatcher#727cdecker wants to merge 2 commits into
cdecker wants to merge 2 commits into
Conversation
Type 18 (WIRE_HSMD_GET_PER_COMMITMENT_POINT) is used by onchaind to derive historical commitment keys during force-close resolution. If the signer doesn't respond it blocks block processing just like the signing types already in the list. Observed on several deeply-lagged nodes where onchaind fires for a closing channel and gets stuck on a GET_PER_COMMITMENT_POINT request before the 10-minute bgsync preemption kicks in. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…patcher A single current_thread Tokio runtime was shared across all OS threads via Arc<Runtime>. When onchaind called block_on() for a type 143 request and blocked forever (no signer connected), no other thread could enter block_on, so type 27 init requests and all subsequent HSM calls were serialised behind the stuck type 143. This prevented type 143 from ever reaching the plugin stager, so stuck_request_types() returned empty and the bgsync session ran the full 10-minute timeout instead of aborting early. Fix: keep one runtime, remove direct block_on from handler threads. Add an mpsc channel to a grpc_dispatcher async task that spawns each gRPC call as an independent tokio::spawn'd task. Handler threads block on oneshot::blocking_recv for their own response; a permanently-stuck type 143 task cannot delay any other task. Adds a regression test that confirms a type 143 lockup does not block a concurrent type 27 request on a different connection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
gl-signerproxyshared a singlecurrent_threadTokio runtime across all OShandler threads via
Arc<Runtime>.current_threadonly allows one concurrentblock_oncaller — whenonchaind's thread calledblock_on(server.request(...))for a type 143 (
SIGN_ANY_REMOTE_HTLC_TO_US) request and blocked indefinitely(no signer connected), every other handler thread was serialised behind it:
stager.requestsstuck_request_types()always returned emptyFix
Keep a single Tokio runtime but remove direct
block_oncalls from handlerthreads. A
GrpcMessageenum carries either a ping or a forwarded HSM request,each with a
oneshot::Senderfor the reply. Anmpscchannel feeds a dedicatedasync
grpc_dispatcherthattokio::spawns each message as an independent task.Handler threads call
blocking_send/blocking_recv— fully concurrent; apermanently-stuck type 143 task can no longer delay any other task.
With this change, type 143 requests reach the plugin stager,
stuck_request_types()detects them, and the bgsync session aborts within 10 seconds.
Test
type_143_lockup_does_not_block_other_requests— two connections share a mockdispatcher where type 143 sleeps for 60 s. The test confirms that a type 27
request on the second connection receives its response in well under 5 seconds
despite the concurrent blocked type 143.
Test plan
cargo test -p gl-signerproxypasses (regression test runs in ~0.15 s)cargo check -p gl-signerproxycleanstuck_request_types: [143]andearly-abort within 10 s when a node is stuck on an onchaind type 143 request