[Arrow] Typed zero-boxing column-major build for single-value primitive columns by real-mj-song · Pull Request #18797 · apache/pinot

real-mj-song · 2026-06-18T05:59:43Z

TL;DR: The column-major build added in #18638 removed per-row GenericRow materialization, but it still consumes every value through the generic Object path — so each primitive is boxed twice, once on the stats pass and once on the index pass. This PR adds a typed, boxing-free fast path for single-value INT/LONG/FLOAT/DOUBLE columns: ColumnReader.nextInt() → collect(int) / indexOfSV(int) → IndexCreator.addInt(value, dictId). Purely additive; nulls, multi-value, and every other type fall back to the unchanged Object path. No SPI changes, no new public API.

Tracks #18629. Builds on #18638 (column-major build) and the ColumnReader / ColumnReaderFactory SPI (#16727).

Problem

#18629 set out to build a segment from a columnar source without "the per-row GenericRow allocation and per-primitive boxing." #18638 delivered the column-major path and removed the GenericRow allocation, but it routes values through the generic Object API:

columnReader.next()                         // Object — boxes the primitive
  -> ColumnarValueNormalizer.normalize(..)  // Object
  -> dictionaryCreator.indexOfSV(Object)    // boxed dictionary lookup
  -> creator.add(Object, dictId)            // unboxes again to write

So the per-primitive boxing #18629 aimed to remove is still on the hot path — and twice over, since buildColumnar() reads every column once for stats and once for indexing. (This is the follow-up requested in the #18638 review.)

What this PR does

The ColumnReader SPI (#16727) already exposes typed accessors (nextInt(), isInt(), …) and the downstream sinks already have primitive overloads — AbstractColumnStatisticsCollector.collect(int), SegmentDictionaryCreator.indexOfSV(int), and addInt(value, dictId) on ForwardIndexCreator / CombinedInvertedIndexCreator / DictionaryBasedInvertedIndexCreator. They were simply never wired to the column-major build. This PR wires them:

Stats pass (ColumnarSegmentPreIndexStatsContainer) — when the reader can serve a single-value column as a primitive (isInt()/isLong()/…), read with nextInt()/… and feed collect(int)/… directly.
Index pass (SegmentColumnarIndexCreator) — same dispatch: nextInt() → indexOfSV(int) → creator.addInt(value, dictId) across the column's index creators (creators with no typed override keep the boxing default and stay correct).

Key choices:

Gated on the logical FieldSpec type (INT/LONG/FLOAT/DOUBLE), so BOOLEAN (stored INT) and TIMESTAMP (stored LONG) — which require value coercion the typed read would skip — correctly stay on the Object path.
Null docs route through the existing Object path for byte-for-byte parity, including null-value-vector marking and default substitution. The typed primitive accessor is only called when isNextNull() is false.
Multi-value and non-primitive types are unchanged.

Net effect: for single-value primitive columns, neither build pass boxes — removing the remaining per-primitive boxing on both column reads.

Potential saving

Each column is read twice (stats pass + index pass), so the Object path boxes every single-value primitive value 2 × N times for an N-row column. The typed path avoids, per such column:

Type	Garbage avoided
`INT` / `FLOAT`	`2 × N × 16 B`
`LONG` / `DOUBLE`	`2 × N × 24 B`

Wrapper sizes assume a 64-bit JVM with compressed oops (Integer / Long skip the −128..127 cache). For example a 1M-row column eliminates ~32 MB (INT / FLOAT) or ~48 MB (LONG / DOUBLE) of short-lived allocation and 2 × N boxing ops — allocation/GC pressure only; retained heap and segment size are unchanged.

Scope

Single-value INT/LONG/FLOAT/DOUBLE only. The typed multi-value sinks (addIntMV, nextIntMV / MultiValueResult) already exist but are intentionally not wired here — the multi-value element-null contract is a distinct concern and is better handled on its own.

Compatibility

Additive and behavior-preserving: no SPI changes, no new public API; the row-major path and the Object-path column-major fallback are untouched. The only non-fast-path edit documents the existing two-pass rewind() precondition on the init(config, ColumnReaderFactory) overload (the consumer that imposes it), leaving the ColumnReaderFactory SPI itself generic.

Testing

This is a behavior-preserving (output-identical) de-box, already covered by the existing column-major equivalence suites — no new tests are needed:

ColumnarRowMajorEquivalenceTest builds a segment column-major vs. row-major over single-value INT/LONG/FLOAT/DOUBLE columns with ~10% nulls and asserts per-doc value and min/max/cardinality equality — exercising the typed dispatch and its null branch.
The pinot-arrow column-major suite (e.g. testRichMultiBatchEquivalence, INT with nulls across batches) exercises the same typed path via an Arrow source.

References

Tracks #18629 · builds on #18638 · ColumnReader SPI #16727

codecov-commenter · 2026-06-18T06:45:43Z

Codecov Report

❌ Patch coverage is 52.04082% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.78%. Comparing base (5bf9e68) to head (3368c9d).
⚠️ Report is 12 commits behind head on master.

Files with missing lines	Patch %	Lines
...ment/creator/impl/SegmentColumnarIndexCreator.java	54.83%	16 Missing and 12 partials ⚠️
...l/stats/ColumnarSegmentPreIndexStatsContainer.java	47.22%	11 Missing and 8 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18797      +/-   ##
============================================
- Coverage     64.78%   64.78%   -0.01%     
  Complexity     1309     1309              
============================================
  Files          3381     3386       +5     
  Lines        209949   210259     +310     
  Branches      32887    32960      +73     
============================================
+ Hits         136013   136213     +200     
- Misses        62977    63061      +84     
- Partials      10959    10985      +26

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.78% <52.04%> (-0.01%)`	⬇️
temurin	`64.78% <52.04%> (-0.01%)`	⬇️
unittests	`64.78% <52.04%> (-0.01%)`	⬇️
unittests1	`56.92% <0.00%> (-0.04%)`	⬇️
unittests2	`37.26% <52.04%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ve columns The column-major build added in apache#18638 removed per-row GenericRow materialization but still consumes each value through the generic Object path (next() -> collect(Object) / indexOfSV(Object) / add(Object, dictId)), so every primitive is boxed on both the stats pass and the index pass. Wire the already-existing typed accessors and primitive sinks together for single-value INT/LONG/FLOAT/DOUBLE columns: read via ColumnReader.nextInt()/... (gated on isInt()/isLong()/...) and feed collect(int), indexOfSV(int), and IndexCreator.addInt(value, dictId) directly. Nulls route through the unchanged Object path for exact parity (including null-value-vector marking and default substitution); multi-value, BOOLEAN/TIMESTAMP, and every other type fall back unchanged. Also document the two-pass rewind/memory precondition on the init(config, ColumnReaderFactory) overload. Completes the per-primitive boxing removal tracked in apache#18629.

real-mj-song force-pushed the arrow-columnar-typed branch from 845a0ea to 3368c9d Compare June 18, 2026 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797

[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797
real-mj-song wants to merge 1 commit into
apache:masterfrom
real-mj-song:arrow-columnar-typed

real-mj-song commented Jun 18, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

real-mj-song commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this PR does

Potential saving

Scope

Compatibility

Testing

References

Uh oh!

codecov-commenter commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

real-mj-song commented Jun 18, 2026 •

edited

Loading

codecov-commenter commented Jun 18, 2026 •

edited

Loading