Speed up uniqueItems validation with structural hashing#1482
Conversation
Replace uniq's sort-then-compare strategy (which fell back to O(n^2) brute force) with an O(n) pass that builds an `equal`-compatible hashable key per element and dedupes via a set. Unhashable elements still fall back to brute force comparison.
for more information, see https://pre-commit.ci
| @@ -1,7 +1,14 @@ | |||
| from collections.abc import Mapping, MutableMapping, Sequence | |||
| from operator import ne | |||
There was a problem hiding this comment.
I don't see a reason to use ne here over just foo != bar.
And can you also revert the import change for re, it just adds noise to the diff.
There was a problem hiding this comment.
I was trying to avoid the noqa as the linter was failing when comparing the same object. I changed it anyway and reverted the re-import too.
| self.assertTrue(uniq([Unhashable(1), Unhashable(2)])) | ||
|
|
||
| def test_nan_is_not_uniquely_hashable(self): | ||
| self.assertFalse(uniq([nan, nan])) |
There was a problem hiding this comment.
This seems slightly misleading, the test is using the same identical nan instance. Probably worth comparing 2 different nans as well.
There was a problem hiding this comment.
Renamed to test_nan_falls_back and added the distinct-instances case. Did the same for the sequence/mapping variants.
There was a problem hiding this comment.
Had to change float("nan") to -nan. On Python 3.11 (both CPython and PyPy) float("nan") returns the math.nan singleton, so nan is float("nan") is True and equal short-circuits on identity. -nan was the only way I found to reliably get a distinct NaN instance across all supported versions.
|
Thanks, seems reasonable overall, left a few minor comments. |
|
Thanks! Nice work. |
Summary
uniqto deduplicate via a set of structural keys compatible with equality checks, instead of usingsorted(...)+ adjacent comparison.uniqueItemsvalidations hit the slow path.uniqcovering the bool/int distinction, structural sequence/mapping equality, NaN, and unhashable elements.Performance
In my case, I was validating a 100 MB JSON file with a custom schema where almost all entries were checked for uniqueness via O(n²) brute force, as values were dicts and could not be sorted. By fixing this to run in linear time, the validation is now >17x faster.