Skip to content

Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772

Open
xiangfu0 wants to merge 2 commits into
apache:masterfrom
xiangfu0:varbyte-v6-target-docs-per-chunk
Open

Add optional targetDocsPerChunk cap to VarByteChunkForwardIndexWriterV6#18772
xiangfu0 wants to merge 2 commits into
apache:masterfrom
xiangfu0:varbyte-v6-target-docs-per-chunk

Conversation

@xiangfu0

@xiangfu0 xiangfu0 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

VarByteChunkForwardIndexWriterV6 currently closes a chunk only when the next
entry would overflow the chunkSize-byte buffer. This PR adds an optional
targetDocsPerChunk parameter so a chunk can additionally be bounded by document
count, letting callers control chunk granularity independently of the byte budget.

  • New 4-arg constructor VarByteChunkForwardIndexWriterV6(file, compressionType, chunkSize, targetDocsPerChunk). The existing 3-arg constructor delegates with
    targetDocsPerChunk = -1 (DISABLE_DOCS_PER_CHUNK).
  • When targetDocsPerChunk > 0, a chunk is flushed once it holds that many docs,
    even if the byte buffer isn't full; otherwise behavior is unchanged.
  • The buffer-overflow flush predicate is extracted into a protected
    shouldStartNewChunk(int) hook in VarByteChunkForwardIndexWriterV4 (mirroring
    the existing writeChunkHeader hook), so V6 adds the cap without duplicating
    putBytes().

Motivation

For raw string/bytes columns, the ZSTD compression ratio depends heavily on how
many repeated values fall within a single chunk (the dedup window). Being able to
bound a chunk by document count — not only by bytes — gives finer control over the
size/granularity tradeoff for repetitive columns.

Constructor examples

// Default (3-arg): byte-driven flush only — identical to the prior behavior.
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize);

// Cap each chunk at 1000 docs: flush at 1000 docs OR when the chunkSize buffer
// fills, whichever comes first.
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize, 1000);

// Explicitly disable the docs cap (equivalent to the 3-arg constructor).
new VarByteChunkForwardIndexWriterV6(file, ChunkCompressionType.ZSTANDARD, chunkSize,
    VarByteChunkForwardIndexWriterV6.DISABLE_DOCS_PER_CHUNK);

Configuring V6 with a chunk size (table config)

Select this writer and the chunk size per column via fieldConfigList:

{
  "fieldConfigList": [
    {
      "name": "myColumn",
      "encodingType": "RAW",
      "indexes": {
        "forward": {
          "compressionCodec": "ZSTANDARD",
          "rawIndexWriterVersion": 6,
          "targetMaxChunkSize": "1MB",
          "targetDocsPerChunk": -1
        }
      }
    }
  ]
}
  • rawIndexWriterVersion: 6 selects this writer.
  • targetMaxChunkSize is the chunk byte budget. The effective chunk size is
    max(min(maxLength × targetDocsPerChunk, targetMaxChunkSize), 4KB), so
    targetDocsPerChunk: -1 makes the chunk size driven purely by targetMaxChunkSize
    (otherwise a narrow column is capped at maxLength × targetDocsPerChunk, e.g.
    ~32 KB for a 32-byte column at the default 1000).

Note: the table-config targetDocsPerChunk derives the chunk byte size; the new
constructor parameter targetDocsPerChunk (a hard per-chunk doc cap) is a
writer-level API and is not wired into table config in this PR.

Backward compatibility

  • The default -1 reproduces the exact current behavior.
  • The on-disk format and writer version (6) are unchanged; the target chunk size
    remains self-describing in the file header, so existing and new indexes stay
    mutually readable.

Testing

  • New VarByteChunkV6Test#testTargetDocsPerChunkCapsChunk asserts each capped chunk
    holds exactly targetDocsPerChunk docs and that values round-trip.
  • All inherited V4/V5/V6 read/write tests pass (the -1 default path).

🤖 Generated with Claude Code

V6 currently flushes a chunk only when the next entry would overflow the
chunkSize-byte buffer. This adds an optional targetDocsPerChunk parameter so a
chunk can also be bounded by document count, letting callers control chunk
granularity independently of the byte budget. The cap defaults to -1
(DISABLE_DOCS_PER_CHUNK), preserving the existing purely byte-driven behavior;
the on-disk format and version (6) are unchanged.

The buffer-overflow flush predicate is extracted into a protected
shouldStartNewChunk() hook in V4 (mirroring the existing writeChunkHeader hook)
so V6 adds the docs cap without duplicating putBytes(). A unit test verifies
each capped chunk holds exactly targetDocsPerChunk docs and round-trips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.80%. Comparing base (54e5cd6) to head (99a96e1).
⚠️ Report is 27 commits behind head on master.

Files with missing lines Patch % Lines
.../writer/impl/VarByteChunkForwardIndexWriterV6.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18772      +/-   ##
============================================
+ Coverage     64.79%   64.80%   +0.01%     
  Complexity     1309     1309              
============================================
  Files          3380     3386       +6     
  Lines        209553   210148     +595     
  Branches      32799    32925     +126     
============================================
+ Hits         135773   136191     +418     
- Misses        62851    62992     +141     
- Partials      10929    10965      +36     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.80% <88.88%> (+0.01%) ⬆️
temurin 64.80% <88.88%> (+0.01%) ⬆️
unittests 64.80% <88.88%> (+0.01%) ⬆️
unittests1 56.99% <22.22%> (+<0.01%) ⬆️
unittests2 37.26% <88.88%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 requested review from Jackie-Jiang and Copilot June 17, 2026 07:07
@xiangfu0 xiangfu0 added the index Related to indexing (general) label Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional document-count cap for chunk flushing in VarByteChunkForwardIndexWriterV6, allowing callers to control chunk granularity by both byte budget (chunkSize) and a new targetDocsPerChunk parameter, while keeping the on-disk format/version unchanged.

Changes:

  • Introduces a protected shouldStartNewChunk(int sizeRequired) hook (and a helper for current-chunk doc count) in V4 so subclasses can add extra chunk-flush predicates without duplicating putBytes().
  • Extends V6 with a 4-arg constructor and a docs-per-chunk cap that triggers chunk flushes once the current chunk reaches the configured doc count.
  • Adds a V6 unit test intended to validate docs-per-chunk behavior and round-trip correctness.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java Extracts the “start new chunk” predicate into an overridable hook and exposes current-chunk doc count for subclasses.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java Adds optional targetDocsPerChunk support via shouldStartNewChunk override and a new constructor overload.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java Adds a test for docs-per-chunk capping and value round-tripping.

…counts

- Fail fast in the constructor when targetDocsPerChunk is neither -1 (disabled)
  nor positive, instead of silently treating 0/other negatives as disabled.
- Strengthen the test to verify per-chunk document counts (not just the chunk
  count) by reading each chunk's first docId from the metadata and asserting it
  equals chunk * targetDocsPerChunk; use 1000 docs with a cap of 30 so the last
  chunk holds a remainder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants