Skip to content

[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797

Open
real-mj-song wants to merge 1 commit into
apache:masterfrom
real-mj-song:arrow-columnar-typed
Open

[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797
real-mj-song wants to merge 1 commit into
apache:masterfrom
real-mj-song:arrow-columnar-typed

Conversation

@real-mj-song

@real-mj-song real-mj-song commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

TL;DR: The column-major build added in #18638 removed per-row GenericRow materialization, but it still consumes every value through the generic Object path — so each primitive is boxed twice, once on the stats pass and once on the index pass. This PR adds a typed, boxing-free fast path for single-value INT/LONG/FLOAT/DOUBLE columns: ColumnReader.nextInt()collect(int) / indexOfSV(int)IndexCreator.addInt(value, dictId). Purely additive; nulls, multi-value, and every other type fall back to the unchanged Object path. No SPI changes, no new public API.

Tracks #18629. Builds on #18638 (column-major build) and the ColumnReader / ColumnReaderFactory SPI (#16727).

Problem

#18629 set out to build a segment from a columnar source without "the per-row GenericRow allocation and per-primitive boxing." #18638 delivered the column-major path and removed the GenericRow allocation, but it routes values through the generic Object API:

columnReader.next()                         // Object — boxes the primitive
  -> ColumnarValueNormalizer.normalize(..)  // Object
  -> dictionaryCreator.indexOfSV(Object)    // boxed dictionary lookup
  -> creator.add(Object, dictId)            // unboxes again to write

So the per-primitive boxing #18629 aimed to remove is still on the hot path — and twice over, since buildColumnar() reads every column once for stats and once for indexing. (This is the follow-up requested in the #18638 review.)

What this PR does

The ColumnReader SPI (#16727) already exposes typed accessors (nextInt(), isInt(), …) and the downstream sinks already have primitive overloads — AbstractColumnStatisticsCollector.collect(int), SegmentDictionaryCreator.indexOfSV(int), and addInt(value, dictId) on ForwardIndexCreator / CombinedInvertedIndexCreator / DictionaryBasedInvertedIndexCreator. They were simply never wired to the column-major build. This PR wires them:

  • Stats pass (ColumnarSegmentPreIndexStatsContainer) — when the reader can serve a single-value column as a primitive (isInt()/isLong()/…), read with nextInt()/… and feed collect(int)/… directly.
  • Index pass (SegmentColumnarIndexCreator) — same dispatch: nextInt()indexOfSV(int)creator.addInt(value, dictId) across the column's index creators (creators with no typed override keep the boxing default and stay correct).

Key choices:

  • Gated on the logical FieldSpec type (INT/LONG/FLOAT/DOUBLE), so BOOLEAN (stored INT) and TIMESTAMP (stored LONG) — which require value coercion the typed read would skip — correctly stay on the Object path.
  • Null docs route through the existing Object path for byte-for-byte parity, including null-value-vector marking and default substitution. The typed primitive accessor is only called when isNextNull() is false.
  • Multi-value and non-primitive types are unchanged.

Net effect: for single-value primitive columns, neither build pass boxes — removing the remaining per-primitive boxing on both column reads.

Potential saving

Each column is read twice (stats pass + index pass), so the Object path boxes every single-value primitive value 2 × N times for an N-row column. The typed path avoids, per such column:

Type Garbage avoided
INT / FLOAT 2 × N × 16 B
LONG / DOUBLE 2 × N × 24 B

Wrapper sizes assume a 64-bit JVM with compressed oops (Integer / Long skip the −128..127 cache). For example a 1M-row column eliminates ~32 MB (INT / FLOAT) or ~48 MB (LONG / DOUBLE) of short-lived allocation and 2 × N boxing ops — allocation/GC pressure only; retained heap and segment size are unchanged.

Scope

Single-value INT/LONG/FLOAT/DOUBLE only. The typed multi-value sinks (addIntMV, nextIntMV / MultiValueResult) already exist but are intentionally not wired here — the multi-value element-null contract is a distinct concern and is better handled on its own.

Compatibility

Additive and behavior-preserving: no SPI changes, no new public API; the row-major path and the Object-path column-major fallback are untouched. The only non-fast-path edit documents the existing two-pass rewind() precondition on the init(config, ColumnReaderFactory) overload (the consumer that imposes it), leaving the ColumnReaderFactory SPI itself generic.

Testing

This is a behavior-preserving (output-identical) de-box, already covered by the existing column-major equivalence suites — no new tests are needed:

  • ColumnarRowMajorEquivalenceTest builds a segment column-major vs. row-major over single-value INT/LONG/FLOAT/DOUBLE columns with ~10% nulls and asserts per-doc value and min/max/cardinality equality — exercising the typed dispatch and its null branch.
  • The pinot-arrow column-major suite (e.g. testRichMultiBatchEquivalence, INT with nulls across batches) exercises the same typed path via an Arrow source.

References

Tracks #18629 · builds on #18638 · ColumnReader SPI #16727

@codecov-commenter

codecov-commenter commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 52.04082% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.78%. Comparing base (5bf9e68) to head (3368c9d).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...ment/creator/impl/SegmentColumnarIndexCreator.java 54.83% 16 Missing and 12 partials ⚠️
...l/stats/ColumnarSegmentPreIndexStatsContainer.java 47.22% 11 Missing and 8 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18797      +/-   ##
============================================
- Coverage     64.78%   64.78%   -0.01%     
  Complexity     1309     1309              
============================================
  Files          3381     3386       +5     
  Lines        209949   210259     +310     
  Branches      32887    32960      +73     
============================================
+ Hits         136013   136213     +200     
- Misses        62977    63061      +84     
- Partials      10959    10985      +26     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.78% <52.04%> (-0.01%) ⬇️
temurin 64.78% <52.04%> (-0.01%) ⬇️
unittests 64.78% <52.04%> (-0.01%) ⬇️
unittests1 56.92% <0.00%> (-0.04%) ⬇️
unittests2 37.26% <52.04%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ve columns

The column-major build added in apache#18638 removed per-row GenericRow materialization
but still consumes each value through the generic Object path
(next() -> collect(Object) / indexOfSV(Object) / add(Object, dictId)), so every
primitive is boxed on both the stats pass and the index pass.

Wire the already-existing typed accessors and primitive sinks together for
single-value INT/LONG/FLOAT/DOUBLE columns: read via ColumnReader.nextInt()/...
(gated on isInt()/isLong()/...) and feed collect(int), indexOfSV(int), and
IndexCreator.addInt(value, dictId) directly. Nulls route through the unchanged
Object path for exact parity (including null-value-vector marking and default
substitution); multi-value, BOOLEAN/TIMESTAMP, and every other type fall back
unchanged. Also document the two-pass rewind/memory precondition on the
init(config, ColumnReaderFactory) overload.

Completes the per-primitive boxing removal tracked in apache#18629.
@real-mj-song real-mj-song force-pushed the arrow-columnar-typed branch from 845a0ea to 3368c9d Compare June 18, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants