Add fastcdc chunker (keyed Gear hash)#9824
Merged
Merged
Conversation
c16e0fe to
f41a414
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9824 +/- ##
==========================================
- Coverage 85.17% 85.17% -0.01%
==========================================
Files 93 93
Lines 15346 15372 +26
Branches 2308 2318 +10
==========================================
+ Hits 13071 13093 +22
- Misses 1581 1583 +2
- Partials 694 696 +2 ☔ View full report in Codecov by Harness. |
f41a414 to
afa8189
Compare
666b542 to
7c47591
Compare
Add a new "fastcdc" content-defined chunker selectable via --chunker-params.
It uses the FastCDC Gear rolling hash (fp = (fp << 1) + Gear[byte]), which is
window-less and cheaper per byte than buzhash's cyclic-polynomial update, so it
chunks noticeably faster (see "borg benchmark cpu" output), while producing
the same chunk-size distribution and deduplication.
The Gear table is keyed: it is derived from the repo id key via CSPRNG (own
"fastcdc" domain), exactly like the buzhash64 table, so chunk cut points stay
unpredictable without the key (anti-fingerprinting). It implements the same
FastCDC techniques as buzhash64 (sub-minimum skipping, normalized chunking with
a required nc_level, min/max clamping); the mask uses the high bits of the hash
(Gear accumulates entropy there).
chunker-params: "fastcdc,chunk_min,chunk_max,chunk_mask,nc_level" - there is no
window field, because Gear is window-less. e.g. fastcdc,19,23,21,2
Also: borg benchmark cpu now measures the fastcdc chunker; tests in
borg.testsuite.chunkers (golden vector, size distribution, keyed gear table,
param parsing, slow fuzz); docs and changelog.
Benchmarks (scripts/chunker_bench.py, buzhash64 vs fastcdc, both nc_level=2,
incompressible data unless noted):
5 GiB, 2 MiB target (default params):
buzhash64: CV 0.294, 1011 MB/s
fastcdc: CV 0.295, 1313 MB/s (+30%)
64 MiB, 64 KiB target:
buzhash64: CV 0.374, shift-resilience 0.9928, 963 MB/s
fastcdc: CV 0.359, shift-resilience 0.9929, 1331 MB/s (+38%)
Re-backup of a 2.5 GiB file after scattered single-byte edits (dedup ratio,
0.5 = v2 fully deduplicated, lower is better):
64 edits: buzhash64 0.5237, fastcdc 0.5236
320 edits: buzhash64 0.6133, fastcdc 0.6161
borg benchmark cpu, 1 GB: fastcdc 3.80s, buzhash 4.36s, buzhash64 8.13s,
fixed 0.56s.
Chunk-size distribution, deduplication and shift-resilience match buzhash64
within noise; fastcdc is consistently faster.
Also: fix bug when computing the mask, one needs to use 1ULL instead of
1, so the shifting computation is done in a uint64, not in a 32bit int.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
For buzhash64 and fastcdc, chunker params like chunk_min == chunk_max passed CLI validation, but then failed with an AssertionError when the chunker was constructed (buzhash64 needs window_size + 2^chunk_min + 1 <= 2^chunk_max, fastcdc needs 2^chunk_min + 1 <= 2^chunk_max). Check this at parse time and raise a proper ArgumentTypeError instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mode
A wrong field count for a named algo (e.g. fastcdc,19,23,21 or
buzhash,19,23,21 with a forgotten field) used to fall through to the
old-style buzhash compat branch, which then crashed with a ValueError
from int('fastcdc') instead of raising a proper ArgumentTypeError.
- restrict the old-style compat branch to what it is for: four all-numeric
fields without an algorithm name (19,23,21,4095 still works).
- for buzhash64 and fastcdc, a wrong field count now raises an
ArgumentTypeError that spells out the expected format, which helps since
buzhash64 recently gained nc_level and fastcdc has no window field.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fill() is declared 'except 0' and always returns 1: errors propagate as exceptions, so the 'if not self.fill(): return None' branches were unreachable (and process() returning None would have crashed __next__ anyway). Call fill() plainly instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
7c47591 to
831e47d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New
fastcdcchunker (keyed Gear hash)A FastCDC content-defined chunker using the window-less Gear rolling hash
(
fp = (fp << 1) + Gear[byte]), which is cheaper per byte than buzhash'scyclic-polynomial update, so it chunks noticeably faster while producing the same
chunk-size distribution and deduplication.
The Gear table is keyed: derived from the repo id key via CSPRNG (own
fastcdcdomain), exactly like the buzhash64 table, so chunk cut points stayunpredictable without the key (anti-fingerprinting). It implements the same
FastCDC techniques as buzhash64 (sub-minimum skipping, normalized chunking with a
required
nc_level, min/max clamping); the mask uses the high bits of the hash.chunker-params:fastcdc,chunk_min,chunk_max,chunk_mask,nc_level— no windowfield, because Gear is window-less. E.g.
fastcdc,19,23,21,2.borg benchmark cpunow measures the fastcdc chunker; tests live inborg.testsuite.chunkers(golden vector, size distribution, keyed gear table,param parsing, slow fuzz); docs and changelog updated.
Benchmarks
scripts/chunker_bench.py, buzhash64 vs fastcdc, bothnc_level=2, incompressibledata unless noted:
borg benchmark cpu, 1 GB: fastcdc 3.80s, buzhash 4.36s, buzhash64 8.13s, fixed 0.56s.Chunk-size distribution, deduplication and shift-resilience match buzhash64 within
noise; fastcdc is consistently faster.
🤖 Generated with Claude Code