Skip to content

gh-144586: Use CPU-specific instructions for _Py_yield (AArch64 only)#149784

Open
dpdani wants to merge 3 commits into
python:mainfrom
dpdani:gh/144586-yield-nano-delay-aarch64
Open

gh-144586: Use CPU-specific instructions for _Py_yield (AArch64 only)#149784
dpdani wants to merge 3 commits into
python:mainfrom
dpdani:gh/144586-yield-nano-delay-aarch64

Conversation

@dpdani
Copy link
Copy Markdown
Contributor

@dpdani dpdani commented May 13, 2026

This PR adds a nano_delay function to avoid relying on the OS scheduler when using _Py_yield to back off from a contended mutation. Only AArch64 code paths have been added.

The _PyMutex_LockTimed function was updated to use an exponential backoff, which improves acquisition throughput on highly contended locks.

In this PR the nano_delay implementation based on the wfet instruction was omitted because it requires runtime dispatching: not all AArch64 CPUs implement this feature. Using compiler macros would not be a sufficient check. It is possible for another PR to also add it.

This change shows performance improvements on the lockbench benchmark, tested with the following parameters:

  • low contention: --work-inside 5 --work-outside 50 --num-locks 24 --acquisitions 3 --random-locks
  • high contention: --work-inside 5 --work-outside 5
graviton_3_high graviton_3_low m4_max_high m4_max_low

The execution on the Graviton 3 machine, which has a high core count, exhibited major bottlenecks in scalability past a certain number of processors, and this was also reproduced on a number of other machines. This is a problem also on main and I will open a separate issue for that in the future.

@diegorusso diegorusso requested a review from colesbury May 14, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant