Skip to content

Bound waited example servers with timeout to fix flaky CI test hangs#10758

Open
julek-wolfssl wants to merge 3 commits into
wolfSSL:masterfrom
julek-wolfssl:fix-ocsp-stapling-tls13multi-wait-timeout
Open

Bound waited example servers with timeout to fix flaky CI test hangs#10758
julek-wolfssl wants to merge 3 commits into
wolfSSL:masterfrom
julek-wolfssl:fix-ocsp-stapling-tls13multi-wait-timeout

Conversation

@julek-wolfssl

Copy link
Copy Markdown
Member

Problem

Several test scripts background the example server and then wait for it to
exit with no timeout. When the server occasionally fails to exit — a timing
race seen under heavy parallel CI load — the script blocks until the CI job's
timeout-minutes, cancelling the whole trackmemory run. This was seen
consistently on the all-wolfentropy config via
ocsp-stapling_tls13multi.test (cases 6 and 7).

Fix

Wrap the waited example servers in timeout -s KILL 2m (the same pattern
scripts/dtls.test already uses), so a stuck server is killed and the test
fails fast instead of hanging the entire job.

Scripts updated:

  • ocsp-stapling_tls13multi.test
  • ocsp-stapling.test
  • ocsp-stapling2.test
  • ocsp-stapling-with-wolfssl-responder.test
  • crl-revoked.test
  • tls13.test
  • resume.test
  • pkcallbacks.test
  • dtlscid.test

Test cases 6 and 7 background the example server and then "wait" for it
to exit. When the server occasionally fails to exit (a timing race under
heavy parallel CI load), the script blocks until the job's
timeout-minutes, cancelling the whole trackmemory run - seen
consistently on the all-wolfentropy config.

Wrap those two servers in "timeout -s KILL 2m" (as scripts/dtls.test
already does) so a stuck server is killed and the test fails fast instead
of timing out the whole job.
Several test scripts share the same pattern as ocsp-stapling_tls13multi:
a backgrounded example server is "wait"ed on with no timeout, so a
server that flakily fails to exit blocks the script until the CI job
timeout. Wrap those servers in "timeout -s KILL 2m" as well.

Scripts: ocsp-stapling, ocsp-stapling2,
ocsp-stapling-with-wolfssl-responder, crl-revoked, tls13, resume,
pkcallbacks, dtlscid.
Copilot AI review requested due to automatic review settings June 23, 2026 09:03
@julek-wolfssl julek-wolfssl self-assigned this Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent CI hangs in several test scripts by bounding backgrounded example servers with a timeout, so that a stuck server process is force-killed and the test run fails quickly rather than blocking until the CI job timeout.

Changes:

  • Wrap multiple ./examples/server/server invocations with timeout -s KILL 2m across several .test scripts.
  • Apply the same timeout pattern to TLS 1.3 early-data server runs (piped to tee).
  • Add timeouts to OCSP stapling and CRL-revocation related server invocations that previously could cause indefinite wait hangs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/tls13.test Wrap TLS 1.3 early-data example server runs in a 2-minute SIGKILL timeout to avoid hangs.
scripts/resume.test Wrap the resume-test example server in a 2-minute SIGKILL timeout.
scripts/pkcallbacks.test Wrap the pkcallbacks example server in a 2-minute SIGKILL timeout.
scripts/ocsp-stapling2.test Add 2-minute SIGKILL timeout around OCSP stapling servers in specific test cases.
scripts/ocsp-stapling.test Add 2-minute SIGKILL timeout around the backgrounded OCSP interop server.
scripts/ocsp-stapling-with-wolfssl-responder.test Add 2-minute SIGKILL timeout around many OCSP stapling server launches to prevent indefinite waits.
scripts/ocsp-stapling_tls13multi.test Add 2-minute SIGKILL timeout around TLS 1.3 multi-stapling server launches to avoid hangs.
scripts/dtlscid.test Add 2-minute SIGKILL timeout around the DTLS CID server launch.
scripts/crl-revoked.test Add 2-minute SIGKILL timeout around CRL revoked-cert server launches (including hashdir test).
Comments suppressed due to low confidence (1)

scripts/dtlscid.test:61

  • The server is now wrapped in timeout, so wait $SERVER_PID can return a non-zero exit status on timeout or server failure, but the script doesn’t check it. That can allow the test to pass even if the server hung and was killed by the timeout.

Check the wait result and fail the test when the server exits non-cleanly (including due to timeout).

    timeout -s KILL 2m $WOLFSSL_ROOT/examples/server/server -v4 -u --cid $SCID 1> $SERVER_FILE &
    SERVER_PID=$!
    sleep 0.2
    $WOLFSSL_ROOT/examples/client/client -v4 -u --cid $CCID 1> $CLIENT_FILE
    wait $SERVER_PID

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/resume.test
Comment thread scripts/pkcallbacks.test
Comment thread scripts/crl-revoked.test
Comment thread scripts/crl-revoked.test
@julek-wolfssl julek-wolfssl marked this pull request as ready for review June 23, 2026 09:38
@github-actions

Copy link
Copy Markdown

retest this please

timeout(1) is GNU coreutils and is not installed on macOS, so the
"make check macos" job failed with "timeout: command not found" for
every wrapped server. Add a small shim to each affected test: when
timeout is unavailable (e.g. macOS) run the server unbounded, restoring
the prior macOS behavior. The flaky hang the timeout guards against is on
the Linux-only trackmemory job, so macOS does not need the bound.
@julek-wolfssl

Copy link
Copy Markdown
Member Author

retest this please flaky test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants