Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6 by xiangfu0 · Pull Request #18772 · apache/pinot

xiangfu0 · 2026-06-16T00:29:59Z

Summary

VarByteChunkForwardIndexWriterV6 currently closes a chunk only when the next
entry would overflow the chunkSize-byte buffer. This PR adds an optional
targetDocsPerChunk parameter so a chunk can additionally be bounded by document
count, letting callers control chunk granularity independently of the byte budget.

New 4-arg constructor VarByteChunkForwardIndexWriterV6(file, compressionType, chunkSize, targetDocsPerChunk). The existing 3-arg constructor delegates with
targetDocsPerChunk = -1 (DISABLE_DOCS_PER_CHUNK).
When targetDocsPerChunk > 0, a chunk is flushed once it holds that many docs,
even if the byte buffer isn't full; otherwise behavior is unchanged.
The buffer-overflow flush predicate is extracted into a protected
shouldStartNewChunk(int) hook in VarByteChunkForwardIndexWriterV4 (mirroring
the existing writeChunkHeader hook), so V6 adds the cap without duplicating
putBytes().

Motivation

For raw string/bytes columns, the ZSTD compression ratio depends heavily on how
many repeated values fall within a single chunk (the dedup window). Being able to
bound a chunk by document count — not only by bytes — gives finer control over the
size/granularity tradeoff for repetitive columns.

Constructor examples

// Default (3-arg): byte-driven flush only — identical to the prior behavior.
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize);

// Cap each chunk at 1000 docs: flush at 1000 docs OR when the chunkSize buffer
// fills, whichever comes first.
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize, 1000);

// Explicitly disable the docs cap (equivalent to the 3-arg constructor).
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize,
    VarByteChunkForwardIndexWriterV6.DISABLE_DOCS_PER_CHUNK);

Configuring V6 with a chunk size (table config)

Select this writer and the chunk size per column via fieldConfigList:

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "indexes": {
        "forward": {
          "compressionCodec": "ZSTANDARD",
          "rawIndexWriterVersion": 6,
          "targetMaxChunkSize": "1MB",
          "targetDocsPerChunk": -1
        }
      }
    }
  ]
}

rawIndexWriterVersion: 6 selects this writer.
targetMaxChunkSize is the chunk byte budget. The effective chunk size is
max(min(maxLength × targetDocsPerChunk, targetMaxChunkSize), 4KB), so
targetDocsPerChunk: -1 makes the chunk size driven purely by targetMaxChunkSize
(otherwise a narrow column is capped at maxLength × targetDocsPerChunk, e.g.
~32 KB for a 32-byte column at the default 1000).

Note: the table-config targetDocsPerChunk derives the chunk byte size; the new
constructor parameter targetDocsPerChunk (a hard per-chunk doc cap) is a
writer-level API and is not wired into table config in this PR.

Backward compatibility

The default -1 reproduces the exact current behavior.
The on-disk format and writer version (6) are unchanged; the target chunk size
remains self-describing in the file header, so existing and new indexes stay
mutually readable.

Testing

New VarByteChunkV6Test#testTargetDocsPerChunkCapsChunk asserts each capped chunk
holds exactly targetDocsPerChunk docs and that values round-trip.
All inherited V4/V5/V6 read/write tests pass (the -1 default path).

🤖 Generated with Claude Code

V6 currently flushes a chunk only when the next entry would overflow the chunkSize-byte buffer. This adds an optional targetDocsPerChunk parameter so a chunk can also be bounded by document count, letting callers control chunk granularity independently of the byte budget. The cap defaults to -1 (DISABLE_DOCS_PER_CHUNK), preserving the existing purely byte-driven behavior; the on-disk format and version (6) are unchanged. The buffer-overflow flush predicate is extracted into a protected shouldStartNewChunk() hook in V4 (mirroring the existing writeChunkHeader hook) so V6 adds the docs cap without duplicating putBytes(). A unit test verifies each capped chunk holds exactly targetDocsPerChunk docs and round-trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-16T01:22:10Z

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.80%. Comparing base (54e5cd6) to head (99a96e1).
⚠️ Report is 27 commits behind head on master.

Files with missing lines	Patch %	Lines
.../writer/impl/VarByteChunkForwardIndexWriterV6.java	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18772      +/-   ##
============================================
+ Coverage     64.79%   64.80%   +0.01%     
  Complexity     1309     1309              
============================================
  Files          3380     3386       +6     
  Lines        209553   210148     +595     
  Branches      32799    32925     +126     
============================================
+ Hits         135773   136191     +418     
- Misses        62851    62992     +141     
- Partials      10929    10965      +36

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.80% <88.88%> (+0.01%)`	⬆️
temurin	`64.80% <88.88%> (+0.01%)`	⬆️
unittests	`64.80% <88.88%> (+0.01%)`	⬆️
unittests1	`56.99% <22.22%> (+<0.01%)`	⬆️
unittests2	`37.26% <88.88%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds an optional document-count cap for chunk flushing in VarByteChunkForwardIndexWriterV6, allowing callers to control chunk granularity by both byte budget (chunkSize) and a new targetDocsPerChunk parameter, while keeping the on-disk format/version unchanged.

Changes:

Introduces a protected shouldStartNewChunk(int sizeRequired) hook (and a helper for current-chunk doc count) in V4 so subclasses can add extra chunk-flush predicates without duplicating putBytes().
Extends V6 with a 4-arg constructor and a docs-per-chunk cap that triggers chunk flushes once the current chunk reaches the configured doc count.
Adds a V6 unit test intended to validate docs-per-chunk behavior and round-trip correctness.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java	Extracts the “start new chunk” predicate into an overridable hook and exposes current-chunk doc count for subclasses.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java	Adds optional `targetDocsPerChunk` support via `shouldStartNewChunk` override and a new constructor overload.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java	Adds a test for docs-per-chunk capping and value round-tripping.

…counts - Fail fast in the constructor when targetDocsPerChunk is neither -1 (disabled) nor positive, instead of silently treating 0/other negatives as disabled. - Strengthen the test to verify per-chunk document counts (not just the chunk count) by reading each chunk's first docId from the metadata and asserting it equals chunk * targetDocsPerChunk; use 1000 docs with a cap of 30 so the last chunk holds a remainder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

xiangfu0 requested review from Jackie-Jiang and Copilot June 17, 2026 07:07

xiangfu0 added the index Related to indexing (general) label Jun 17, 2026

Copilot started reviewing on behalf of xiangfu0 June 17, 2026 07:07 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread ...ain/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java

Comment thread ...l/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772

Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772
xiangfu0 wants to merge 2 commits into
apache:masterfrom
xiangfu0:varbyte-v6-target-docs-per-chunk

xiangfu0 commented Jun 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiangfu0 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Constructor examples

Configuring V6 with a chunk size (table config)

Backward compatibility

Testing

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xiangfu0 commented Jun 16, 2026 •

edited

Loading

codecov-commenter commented Jun 16, 2026 •

edited

Loading