CLI: Stats command - fix incorrect CPU % reporting#40627
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect CPU% reporting in the wslc ... stats command by ensuring Docker’s precpu_stats is populated and by aligning the CLI-side CPU% calculation with Docker CLI’s algorithm.
Changes:
- Removed the
one-shot=truestats query parameter so Docker returnsprecpu_statsalongsidecpu_statswhenstream=false. - Updated CPU% calculation to (a) fall back to
percpu_usagelength whenonline_cpus == 0and (b) match Docker’s> 0delta checks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/windows/wslcsession/DockerHTTPClient.cpp | Adjusts Docker stats request parameters to ensure precpu_stats is populated for CPU% deltas. |
| src/windows/wslc/tasks/ContainerTasks.cpp | Aligns CPU% computation with Docker CLI, including online_cpus fallback behavior and delta checks. |
| const auto onlineCpus = stats.cpu_stats.online_cpus > 0 ? stats.cpu_stats.online_cpus : 1u; | ||
| if (systemDelta > 0.0 && cpuDelta >= 0.0) | ||
|
|
||
| // When online_cpus is 0 (older API responses), fall back to percpu_usage array length — matches Docker CLI behavior. |
There was a problem hiding this comment.
Is that something that can happen with our version of docker ? If not, I'd just recommend removing that block entirely since the API response we get should be consistent
|
|
||
| VERIFY_IS_GREATER_THAN_OR_EQUAL( | ||
| elapsedMs, c_minExpectedMs, L"Stats completed suspiciously fast - Docker sampling may not be occurring"); | ||
| VERIFY_IS_LESS_THAN( |
There was a problem hiding this comment.
Sadly this is one of the cases that's difficult to test in a reliable way. I don't recommend running this timing check in the CI, since it's likely to break if Defender / Windows Update triggers right as this test runs.
There was a problem hiding this comment.
OK I will remove this test. I have refactored some things with the latest update, and included a test for the batching async code independent of the stats calling it, so that should provide good coverage that the batching / async behavior is working as expected and we can avoid a potentially risky test like this one.
|
|
||
| // Build stats as a json array first for later filtering or display either as json or table format. | ||
| nlohmann::json statsJson = nlohmann::json::array(); | ||
| // Fetch stats for all containers concurrently. The Docker engine blocks for ~1s per |
There was a problem hiding this comment.
Hmm that's too bad that the request hangs for a second. Maybe we could implement a workaround in the service to issue N requests in parallel and return it to the client, but the current API doesn't support that so I think this is OK for now
There was a problem hiding this comment.
Latest iteration does batches of requests in parallel so we can handle the common case very rapidly but protect against extreme cases of very large container counts that might overly stress the endpoints. I put in a batch size of 10 by default but its configurable.
|
|
||
| #include <future> | ||
| #include <optional> |
| template <typename TItem, typename TWork, typename TSuccess, typename TError> | ||
| void ForEachAsync(const std::vector<TItem>& items, TWork onWork, TSuccess onSuccess, TError onError, size_t batchSize = 10) | ||
| { | ||
| WI_ASSERT(batchSize > 0); |
| struct BatchResult | ||
| { | ||
| TItem item; | ||
| std::optional<TResult> result; | ||
| wil::ResultException error{S_OK}; | ||
| bool hasError{false}; | ||
| }; |
| wsl::windows::wslc::ForEachAsync<std::wstring>( | ||
| containers, | ||
| // Work | ||
| [&session](const std::wstring& containerId) { | ||
| return ComputeContainerStatsJson(ContainerService::Stats(session, WideToMultiByte(containerId))); | ||
| }, |
| if (onlineCpus == 0) | ||
| { | ||
| onlineCpus = stats.cpu_stats.cpu_usage.percpu_usage.has_value() | ||
| ? static_cast<uint32_t>(stats.cpu_stats.cpu_usage.percpu_usage->size()) | ||
| : 1u; |
Summary of the Pull Request
The CPU calculation is incorrect due to precpu_stats not being populated from the docker request. This is due to the combination of "stream=false" and "one-shot=true" in the request parameters. In this combination, docker engine does not populate precpu_stats. The fix is to remove "one-shot=true" from the request, which causes docker engine to do two samples internally and then have valid population for precpu_stats.
Since docker does an internal sampling of ~1s, and docker does not support batch container requests (one container id per request), running stats on all active containers now has a performance issue of 1s * N containers. This fix also runs the stats requests for all containers in parallel to mitigate this performance issue, and adds a generous E2E test to verify the parallel processing and performance is as-expected and does not regress.
This fix also corrects some minor differences in the stats calculation in the CLI end for edge cases we will probably never hit but aligning it to be exact for the sake of being as correct as possible with the Docker CLI implementation.
PR Checklist
Detailed Description of the Pull Request / Additional comments
Three changes were made here:
The stats call in the docker request dropped "one-shot=true". This, when combined with "stream=false" caused docker to not populate precpu_stats. They were zeroed out (confirmed in debugger). Removing this line caused docker engine to do an internal 1s block to capture the prior sample. With the one-shot parameter removed, the precpu_stats were now populated (also confirmed in debugger).
The stats calculation had two adjustments to align it exactly with Docker's calculation:
The first change is the real fix to the CPU issue. The second is defensive alignment for edge cases that we should not in practice hit, but might someday encounter, and the third is a mitigation for the potential scaling issue the first change introduced.
Validation Steps Performed
New parallel stats command with 20 'sleep infinity' containers to verify the sampling time for a large number of containers is relatively close to ~1s. This timing is very generous given the test machine may be stressed. Currently it allows for up to 13s for it to run and be considered still running in parallel. If this test starts having random failures we can revisit the parameters or consider removing the test.
1 Ran stress tests with 1 cpu and 16 cpus:
wslc run --rm alpine sh -c "apk add -q stress-ng && stress-ng --cpu 16 --timeout 60"wslc run --rm alpine sh -c "apk add -q stress-ng && stress-ng --cpu 1 --timeout 60"2 Ran stats in another window to confirm values
Confirmed CPU% was around 100% for the --cpu 1 case and 1600% for the --cpu 16 case. This is correct with Docker's CPU% reporting as 100% per core. 1 fully loaded cpu should be 100%, and 16 should be around 1600%. Excess over these expectations is also expected and includes the process's own overhead and init process consuming additional CPU.
Also here is a manual run of 20 containers and the stats command clearly running far less than 20s the sequential command would have.
