Bound waited example servers with timeout to fix flaky CI test hangs#10758
Open
julek-wolfssl wants to merge 3 commits into
Open
Bound waited example servers with timeout to fix flaky CI test hangs#10758julek-wolfssl wants to merge 3 commits into
julek-wolfssl wants to merge 3 commits into
Conversation
Test cases 6 and 7 background the example server and then "wait" for it to exit. When the server occasionally fails to exit (a timing race under heavy parallel CI load), the script blocks until the job's timeout-minutes, cancelling the whole trackmemory run - seen consistently on the all-wolfentropy config. Wrap those two servers in "timeout -s KILL 2m" (as scripts/dtls.test already does) so a stuck server is killed and the test fails fast instead of timing out the whole job.
Several test scripts share the same pattern as ocsp-stapling_tls13multi: a backgrounded example server is "wait"ed on with no timeout, so a server that flakily fails to exit blocks the script until the CI job timeout. Wrap those servers in "timeout -s KILL 2m" as well. Scripts: ocsp-stapling, ocsp-stapling2, ocsp-stapling-with-wolfssl-responder, crl-revoked, tls13, resume, pkcallbacks, dtlscid.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent CI hangs in several test scripts by bounding backgrounded example servers with a timeout, so that a stuck server process is force-killed and the test run fails quickly rather than blocking until the CI job timeout.
Changes:
- Wrap multiple
./examples/server/serverinvocations withtimeout -s KILL 2macross several.testscripts. - Apply the same timeout pattern to TLS 1.3 early-data server runs (piped to
tee). - Add timeouts to OCSP stapling and CRL-revocation related server invocations that previously could cause indefinite
waithangs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/tls13.test | Wrap TLS 1.3 early-data example server runs in a 2-minute SIGKILL timeout to avoid hangs. |
| scripts/resume.test | Wrap the resume-test example server in a 2-minute SIGKILL timeout. |
| scripts/pkcallbacks.test | Wrap the pkcallbacks example server in a 2-minute SIGKILL timeout. |
| scripts/ocsp-stapling2.test | Add 2-minute SIGKILL timeout around OCSP stapling servers in specific test cases. |
| scripts/ocsp-stapling.test | Add 2-minute SIGKILL timeout around the backgrounded OCSP interop server. |
| scripts/ocsp-stapling-with-wolfssl-responder.test | Add 2-minute SIGKILL timeout around many OCSP stapling server launches to prevent indefinite waits. |
| scripts/ocsp-stapling_tls13multi.test | Add 2-minute SIGKILL timeout around TLS 1.3 multi-stapling server launches to avoid hangs. |
| scripts/dtlscid.test | Add 2-minute SIGKILL timeout around the DTLS CID server launch. |
| scripts/crl-revoked.test | Add 2-minute SIGKILL timeout around CRL revoked-cert server launches (including hashdir test). |
Comments suppressed due to low confidence (1)
scripts/dtlscid.test:61
- The server is now wrapped in
timeout, sowait $SERVER_PIDcan return a non-zero exit status on timeout or server failure, but the script doesn’t check it. That can allow the test to pass even if the server hung and was killed by the timeout.
Check the wait result and fail the test when the server exits non-cleanly (including due to timeout).
timeout -s KILL 2m $WOLFSSL_ROOT/examples/server/server -v4 -u --cid $SCID 1> $SERVER_FILE &
SERVER_PID=$!
sleep 0.2
$WOLFSSL_ROOT/examples/client/client -v4 -u --cid $CCID 1> $CLIENT_FILE
wait $SERVER_PID
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
retest this please |
timeout(1) is GNU coreutils and is not installed on macOS, so the "make check macos" job failed with "timeout: command not found" for every wrapped server. Add a small shim to each affected test: when timeout is unavailable (e.g. macOS) run the server unbounded, restoring the prior macOS behavior. The flaky hang the timeout guards against is on the Linux-only trackmemory job, so macOS does not need the bound.
Member
Author
|
retest this please flaky test |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Several test scripts background the example server and then
waitfor it toexit with no timeout. When the server occasionally fails to exit — a timing
race seen under heavy parallel CI load — the script blocks until the CI job's
timeout-minutes, cancelling the wholetrackmemoryrun. This was seenconsistently on the
all-wolfentropyconfig viaocsp-stapling_tls13multi.test(cases 6 and 7).Fix
Wrap the waited example servers in
timeout -s KILL 2m(the same patternscripts/dtls.testalready uses), so a stuck server is killed and the testfails fast instead of hanging the entire job.
Scripts updated:
ocsp-stapling_tls13multi.testocsp-stapling.testocsp-stapling2.testocsp-stapling-with-wolfssl-responder.testcrl-revoked.testtls13.testresume.testpkcallbacks.testdtlscid.test