Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772
Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772xiangfu0 wants to merge 2 commits into
Conversation
V6 currently flushes a chunk only when the next entry would overflow the chunkSize-byte buffer. This adds an optional targetDocsPerChunk parameter so a chunk can also be bounded by document count, letting callers control chunk granularity independently of the byte budget. The cap defaults to -1 (DISABLE_DOCS_PER_CHUNK), preserving the existing purely byte-driven behavior; the on-disk format and version (6) are unchanged. The buffer-overflow flush predicate is extracted into a protected shouldStartNewChunk() hook in V4 (mirroring the existing writeChunkHeader hook) so V6 adds the docs cap without duplicating putBytes(). A unit test verifies each capped chunk holds exactly targetDocsPerChunk docs and round-trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18772 +/- ##
============================================
+ Coverage 64.79% 64.80% +0.01%
Complexity 1309 1309
============================================
Files 3380 3386 +6
Lines 209553 210148 +595
Branches 32799 32925 +126
============================================
+ Hits 135773 136191 +418
- Misses 62851 62992 +141
- Partials 10929 10965 +36
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds an optional document-count cap for chunk flushing in VarByteChunkForwardIndexWriterV6, allowing callers to control chunk granularity by both byte budget (chunkSize) and a new targetDocsPerChunk parameter, while keeping the on-disk format/version unchanged.
Changes:
- Introduces a protected
shouldStartNewChunk(int sizeRequired)hook (and a helper for current-chunk doc count) in V4 so subclasses can add extra chunk-flush predicates without duplicatingputBytes(). - Extends V6 with a 4-arg constructor and a docs-per-chunk cap that triggers chunk flushes once the current chunk reaches the configured doc count.
- Adds a V6 unit test intended to validate docs-per-chunk behavior and round-trip correctness.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java | Extracts the “start new chunk” predicate into an overridable hook and exposes current-chunk doc count for subclasses. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java | Adds optional targetDocsPerChunk support via shouldStartNewChunk override and a new constructor overload. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java | Adds a test for docs-per-chunk capping and value round-tripping. |
…counts - Fail fast in the constructor when targetDocsPerChunk is neither -1 (disabled) nor positive, instead of silently treating 0/other negatives as disabled. - Strengthen the test to verify per-chunk document counts (not just the chunk count) by reading each chunk's first docId from the metadata and asserting it equals chunk * targetDocsPerChunk; use 1000 docs with a cap of 30 so the last chunk holds a remainder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
VarByteChunkForwardIndexWriterV6currently closes a chunk only when the nextentry would overflow the
chunkSize-byte buffer. This PR adds an optionaltargetDocsPerChunkparameter so a chunk can additionally be bounded by documentcount, letting callers control chunk granularity independently of the byte budget.
VarByteChunkForwardIndexWriterV6(file, compressionType, chunkSize, targetDocsPerChunk). The existing 3-arg constructor delegates withtargetDocsPerChunk = -1(DISABLE_DOCS_PER_CHUNK).targetDocsPerChunk > 0, a chunk is flushed once it holds that many docs,even if the byte buffer isn't full; otherwise behavior is unchanged.
shouldStartNewChunk(int)hook inVarByteChunkForwardIndexWriterV4(mirroringthe existing
writeChunkHeaderhook), so V6 adds the cap without duplicatingputBytes().Motivation
For raw string/bytes columns, the ZSTD compression ratio depends heavily on how
many repeated values fall within a single chunk (the dedup window). Being able to
bound a chunk by document count — not only by bytes — gives finer control over the
size/granularity tradeoff for repetitive columns.
Constructor examples
Configuring V6 with a chunk size (table config)
Select this writer and the chunk size per column via
fieldConfigList:{ "fieldConfigList": [ { "name": "myColumn", "encodingType": "RAW", "indexes": { "forward": { "compressionCodec": "ZSTANDARD", "rawIndexWriterVersion": 6, "targetMaxChunkSize": "1MB", "targetDocsPerChunk": -1 } } } ] }rawIndexWriterVersion: 6selects this writer.targetMaxChunkSizeis the chunk byte budget. The effective chunk size ismax(min(maxLength × targetDocsPerChunk, targetMaxChunkSize), 4KB), sotargetDocsPerChunk: -1makes the chunk size driven purely bytargetMaxChunkSize(otherwise a narrow column is capped at
maxLength × targetDocsPerChunk, e.g.~32 KB for a 32-byte column at the default
1000).Note: the table-config
targetDocsPerChunkderives the chunk byte size; the newconstructor parameter
targetDocsPerChunk(a hard per-chunk doc cap) is awriter-level API and is not wired into table config in this PR.
Backward compatibility
-1reproduces the exact current behavior.6) are unchanged; the target chunk sizeremains self-describing in the file header, so existing and new indexes stay
mutually readable.
Testing
VarByteChunkV6Test#testTargetDocsPerChunkCapsChunkasserts each capped chunkholds exactly
targetDocsPerChunkdocs and that values round-trip.-1default path).🤖 Generated with Claude Code