gh-144586: Use CPU-specific instructions for _Py_yield (AArch64 only)#149784
Open
dpdani wants to merge 3 commits into
Open
gh-144586: Use CPU-specific instructions for _Py_yield (AArch64 only)#149784dpdani wants to merge 3 commits into
_Py_yield (AArch64 only)#149784dpdani wants to merge 3 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a
nano_delayfunction to avoid relying on the OS scheduler when using_Py_yieldto back off from a contended mutation. Only AArch64 code paths have been added.The
_PyMutex_LockTimedfunction was updated to use an exponential backoff, which improves acquisition throughput on highly contended locks.In this PR the
nano_delayimplementation based on thewfetinstruction was omitted because it requires runtime dispatching: not all AArch64 CPUs implement this feature. Using compiler macros would not be a sufficient check. It is possible for another PR to also add it.This change shows performance improvements on the
lockbenchbenchmark, tested with the following parameters:--work-inside 5 --work-outside 50 --num-locks 24 --acquisitions 3 --random-locks--work-inside 5 --work-outside 5The execution on the Graviton 3 machine, which has a high core count, exhibited major bottlenecks in scalability past a certain number of processors, and this was also reproduced on a number of other machines. This is a problem also on main and I will open a separate issue for that in the future.