Skip to content

Speed up uniqueItems validation with structural hashing#1482

Merged
Julian merged 4 commits into
python-jsonschema:mainfrom
oscmarb:faster-uniqueitems
May 12, 2026
Merged

Speed up uniqueItems validation with structural hashing#1482
Julian merged 4 commits into
python-jsonschema:mainfrom
oscmarb:faster-uniqueitems

Conversation

@oscmarb
Copy link
Copy Markdown
Contributor

@oscmarb oscmarb commented May 11, 2026

Summary

  • Rewrite uniq to deduplicate via a set of structural keys compatible with equality checks, instead of using sorted(...) + adjacent comparison.
  • The previous strategy degraded to O(n²) brute force when ordering could not be performed, meaning most real uniqueItems validations hit the slow path.
  • Unhashable elements (and NaN) still fall back to brute-force equality comparison, preserving correctness for edge cases.
  • Adds direct unit tests for uniq covering the bool/int distinction, structural sequence/mapping equality, NaN, and unhashable elements.

Performance

In my case, I was validating a 100 MB JSON file with a custom schema where almost all entries were checked for uniqueness via O(n²) brute force, as values were dicts and could not be sorted. By fixing this to run in linear time, the validation is now >17x faster.

JSON size Before After
20 MB 57 s 9 s
100 MB 530 s 30 s

oscmarb and others added 2 commits May 11, 2026 10:27
Replace uniq's sort-then-compare strategy (which fell back to O(n^2) brute force) with an O(n) pass that builds an `equal`-compatible hashable key per element and dedupes via a set. Unhashable elements still fall back to brute force comparison.
Comment thread jsonschema/_utils.py Outdated
@@ -1,7 +1,14 @@
from collections.abc import Mapping, MutableMapping, Sequence
from operator import ne
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to use ne here over just foo != bar.

And can you also revert the import change for re, it just adds noise to the diff.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid the noqa as the linter was failing when comparing the same object. I changed it anyway and reverted the re-import too.

self.assertTrue(uniq([Unhashable(1), Unhashable(2)]))

def test_nan_is_not_uniquely_hashable(self):
self.assertFalse(uniq([nan, nan]))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems slightly misleading, the test is using the same identical nan instance. Probably worth comparing 2 different nans as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_nan_falls_back and added the distinct-instances case. Did the same for the sequence/mapping variants.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to change float("nan") to -nan. On Python 3.11 (both CPython and PyPy) float("nan") returns the math.nan singleton, so nan is float("nan") is True and equal short-circuits on identity. -nan was the only way I found to reliably get a distinct NaN instance across all supported versions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah fun, yeah that seems ok.

@Julian
Copy link
Copy Markdown
Member

Julian commented May 12, 2026

Thanks, seems reasonable overall, left a few minor comments.

@Julian
Copy link
Copy Markdown
Member

Julian commented May 12, 2026

Thanks! Nice work.

@Julian Julian merged commit 9f6fc68 into python-jsonschema:main May 12, 2026
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants