Skip to content

CLI: Stats command - fix incorrect CPU % reporting#40627

Open
dkbennett wants to merge 7 commits into
masterfrom
user/dkbennett/statsfix
Open

CLI: Stats command - fix incorrect CPU % reporting#40627
dkbennett wants to merge 7 commits into
masterfrom
user/dkbennett/statsfix

Conversation

@dkbennett
Copy link
Copy Markdown
Member

@dkbennett dkbennett commented May 22, 2026

Summary of the Pull Request

The CPU calculation is incorrect due to precpu_stats not being populated from the docker request. This is due to the combination of "stream=false" and "one-shot=true" in the request parameters. In this combination, docker engine does not populate precpu_stats. The fix is to remove "one-shot=true" from the request, which causes docker engine to do two samples internally and then have valid population for precpu_stats.

Since docker does an internal sampling of ~1s, and docker does not support batch container requests (one container id per request), running stats on all active containers now has a performance issue of 1s * N containers. This fix also runs the stats requests for all containers in parallel to mitigate this performance issue, and adds a generous E2E test to verify the parallel processing and performance is as-expected and does not regress.

This fix also corrects some minor differences in the stats calculation in the CLI end for edge cases we will probably never hit but aligning it to be exact for the sake of being as correct as possible with the Docker CLI implementation.

PR Checklist

  • Closes: Link to issue #xxx
  • Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
  • Tests: Added/updated if needed and all pass
  • Localization: All end user facing strings can be localized
  • Dev docs: Added/updated if needed
  • Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx

Detailed Description of the Pull Request / Additional comments

Three changes were made here:

  1. The stats call in the docker request dropped "one-shot=true". This, when combined with "stream=false" caused docker to not populate precpu_stats. They were zeroed out (confirmed in debugger). Removing this line caused docker engine to do an internal 1s block to capture the prior sample. With the one-shot parameter removed, the precpu_stats were now populated (also confirmed in debugger).

  2. The stats calculation had two adjustments to align it exactly with Docker's calculation:

  • The "onlineCpus" was changed to have a fallback to percpu_usage size, as docker does. This is because in older docker APIs the stats can report onlineCpus as 0. In this case we had different behavior so this was fixed. The debugger confirmed that this is not actually the case with the APIs we are using so this is only a defensive change in case the API behavior were to change in the future, we should not break.
  • The system delta & cpudelta check was using >= while docker uses >. This also should not have any meaningful effect, but for exact correctness with Docker's algorithm it is being fixed.
  1. Stats are requested per-container and are now in parallel to account for the ~1 sampling time per container. With all requests being run simultaneously the actual runtime of the stats command should be within a few seconds, even with many containers. A test was added to verify this that is intentionally generous to account for a potentially stressed test machine.

The first change is the real fix to the CPU issue. The second is defensive alignment for edge cases that we should not in practice hit, but might someday encounter, and the third is a mitigation for the potential scaling issue the first change introduced.

Validation Steps Performed

New parallel stats command with 20 'sleep infinity' containers to verify the sampling time for a large number of containers is relatively close to ~1s. This timing is very generous given the test machine may be stressed. Currently it allows for up to 13s for it to run and be considered still running in parallel. If this test starts having random failures we can revisit the parameters or consider removing the test.

1 Ran stress tests with 1 cpu and 16 cpus:

  • wslc run --rm alpine sh -c "apk add -q stress-ng && stress-ng --cpu 16 --timeout 60"
  • wslc run --rm alpine sh -c "apk add -q stress-ng && stress-ng --cpu 1 --timeout 60"
    2 Ran stats in another window to confirm values

Confirmed CPU% was around 100% for the --cpu 1 case and 1600% for the --cpu 16 case. This is correct with Docker's CPU% reporting as 100% per core. 1 fully loaded cpu should be 100%, and 16 should be around 1600%. Excess over these expectations is also expected and includes the process's own overhead and init process consuming additional CPU.

image

Also here is a manual run of 20 containers and the stats command clearly running far less than 20s the sequential command would have.
image

@dkbennett dkbennett requested a review from a team as a code owner May 22, 2026 18:00
Copilot AI review requested due to automatic review settings May 22, 2026 18:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect CPU% reporting in the wslc ... stats command by ensuring Docker’s precpu_stats is populated and by aligning the CLI-side CPU% calculation with Docker CLI’s algorithm.

Changes:

  • Removed the one-shot=true stats query parameter so Docker returns precpu_stats alongside cpu_stats when stream=false.
  • Updated CPU% calculation to (a) fall back to percpu_usage length when online_cpus == 0 and (b) match Docker’s > 0 delta checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/windows/wslcsession/DockerHTTPClient.cpp Adjusts Docker stats request parameters to ensure precpu_stats is populated for CPU% deltas.
src/windows/wslc/tasks/ContainerTasks.cpp Aligns CPU% computation with Docker CLI, including online_cpus fallback behavior and delta checks.

Comment thread src/windows/wslcsession/DockerHTTPClient.cpp
const auto onlineCpus = stats.cpu_stats.online_cpus > 0 ? stats.cpu_stats.online_cpus : 1u;
if (systemDelta > 0.0 && cpuDelta >= 0.0)

// When online_cpus is 0 (older API responses), fall back to percpu_usage array length — matches Docker CLI behavior.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something that can happen with our version of docker ? If not, I'd just recommend removing that block entirely since the API response we get should be consistent


VERIFY_IS_GREATER_THAN_OR_EQUAL(
elapsedMs, c_minExpectedMs, L"Stats completed suspiciously fast - Docker sampling may not be occurring");
VERIFY_IS_LESS_THAN(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this is one of the cases that's difficult to test in a reliable way. I don't recommend running this timing check in the CI, since it's likely to break if Defender / Windows Update triggers right as this test runs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will remove this test. I have refactored some things with the latest update, and included a test for the batching async code independent of the stats calling it, so that should provide good coverage that the batching / async behavior is working as expected and we can avoid a potentially risky test like this one.


// Build stats as a json array first for later filtering or display either as json or table format.
nlohmann::json statsJson = nlohmann::json::array();
// Fetch stats for all containers concurrently. The Docker engine blocks for ~1s per
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's too bad that the request hangs for a second. Maybe we could implement a workaround in the service to issue N requests in parallel and return it to the client, but the current API doesn't support that so I think this is OK for now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest iteration does batches of requests in parallel so we can handle the common case very rapidly but protect against extreme cases of very large container counts that might overly stress the endpoints. I put in a batch size of 10 by default but its configurable.

Copilot AI review requested due to automatic review settings May 27, 2026 21:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comment on lines +16 to +18

#include <future>
#include <optional>
template <typename TItem, typename TWork, typename TSuccess, typename TError>
void ForEachAsync(const std::vector<TItem>& items, TWork onWork, TSuccess onSuccess, TError onError, size_t batchSize = 10)
{
WI_ASSERT(batchSize > 0);
Comment on lines +40 to +46
struct BatchResult
{
TItem item;
std::optional<TResult> result;
wil::ResultException error{S_OK};
bool hasError{false};
};
Comment on lines +577 to +582
wsl::windows::wslc::ForEachAsync<std::wstring>(
containers,
// Work
[&session](const std::wstring& containerId) {
return ComputeContainerStatsJson(ContainerService::Stats(session, WideToMultiByte(containerId)));
},
Comment on lines +76 to +80
if (onlineCpus == 0)
{
onlineCpus = stats.cpu_stats.cpu_usage.percpu_usage.has_value()
? static_cast<uint32_t>(stats.cpu_stats.cpu_usage.percpu_usage->size())
: 1u;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants