Skip to content

gh-153044: Auto-possessify greedy repeats with a disjoint continuation#153048

Open
serhiy-storchaka wants to merge 5 commits into
python:mainfrom
serhiy-storchaka:re-possessive
Open

gh-153044: Auto-possessify greedy repeats with a disjoint continuation#153048
serhiy-storchaka wants to merge 5 commits into
python:mainfrom
serhiy-storchaka:re-possessive

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jul 4, 2026

Copy link
Copy Markdown
Member

Add an optimizer pass that turns a greedy repeat into a possessive one when this cannot change what the pattern matches: if the repeated atom and every character that could possibly follow the repeat are provably disjoint, backtracking into the repeat is always futile, so a+b is compiled as if it were a++b. PCRE2 performs the same optimization under the name "auto-possessification".

The analysis only ever fires when it can prove disjointness, so any imprecision can merely cost an optimization, never change a match result. The repeated body may be a single one-character atom or a rigid concatenation (only its leading atom matters); the follower analysis walks through group boundaries, alternations, nullable repeats, atomic groups and the \z/$ anchors. Disjointness is proved by an atom model for \d/\w/\s, by finite witness sets (case-folded under IGNORECASE), or by probing the engine's own category predicate through a new helper _sre.category_matches(), so every engine category is decided exactly: \p{Lu}+x possessifies, and a category and its complement are always disjoint, so \p{X}+\P{X} possessifies for every X. A fused set-operation charset is evaluated exactly, so [\w--\d]+\d and [a-z--b]+b possessify too.

The main benefit is failing or backtracking-heavy matches whose repeat is followed by a character set or category, which the existing REPEAT_ONE literal fast path does not cover: measured 2.2–4.5× on such failing matches, and some catastrophic-backtracking patterns are defused as a side effect. Successful matches are unaffected (~1×). Correction: where disjointness is provable, every give-back already fails at the first following atom, so this removes only a constant factor; the catastrophic nested-repeat patterns like (a+)+b are among the possible extensions.

Validated by differential fuzzing against the unoptimized compilation (millions of checks accumulated across pattern families: random ASCII, IGNORECASE with non-ASCII fold groups, scoped flags, anchors, set operations, \p{...} categories, exhaustive small-alphabet brute force), plus full-range exactness checks of the category probes against the engine.

This is the first step described in the issue; alternation bodies with disjoint first characters, word-boundary and lookahead followers, and category-vs-category reasoning are possible extensions.

…nuation

Add an optimizer pass that turns a greedy repeat into a possessive one
when this cannot change what the pattern matches: if the repeated atom
and every character that could follow the repeat are disjoint,
backtracking into the repeat is always futile.  For example "a+b" is
compiled as if it were "a++b".  PCRE2 calls this "auto-possessification".

The analysis only proves disjointness, so any imprecision can merely
cost an optimization, never change a match result.  It speeds up
backtracking-heavy failures whose tail is a character set or category
(not covered by the REPEAT_ONE fast path) severalfold and defuses some
catastrophic backtracking.

Category atoms are decided by the engine itself through a new helper,
_sre.category_matches(), so every engine category is decided exactly:
\p{Lu}+x possessifies, and a category and its complement are always
disjoint, so \p{X}+\P{X} possessifies for every X.  Under IGNORECASE
only the fold-invariant categories keep their meaning, and the analysis
evaluates a fused set-operation charset exactly, so [\w--\d]+\d and
[a-z--b]+b possessify too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jul 4, 2026

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33444493 | 📁 Comparing 7868451 against main (639a552)

  🔍 Preview build  

2 files changed
± whatsnew/3.16.html
± whatsnew/changelog.html

serhiy-storchaka and others added 4 commits July 4, 2026 23:06
Unicode \w and ascii \W overlap, so a category and its complement are
only disjoint within one flag context; this holds because the walk
never compares atoms across a flag-scoping boundary.  Add tests pinning
that (?a:\w+)\W and \w+(?a:\W) are not possessified, while \w+\W and
(?a:\w+\W) are.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each empty branch alternative appends the continuation to the follower
set again, so patterns like (|)(|)(|)... made the analysis exponential
in time and memory (found by OSS-Fuzz).  Give up when the set exceeds
a small limit; a repeat with dozens of distinct followers is not going
to be proven disjoint anyway.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The follower scan recurses once per following group, so thousands of
sequential groups overflowed the recursion limit (the parser handles
them iteratively).  Add an explicit depth cap -- relying on
RecursionError is fragile near the limit -- and keep catching it as a
backstop: every rewrite already applied is sound on its own, so the
rest of the pattern just stays unoptimized.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant