[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797
Open
real-mj-song wants to merge 1 commit into
Open
[Arrow] Typed zero-boxing column-major build for single-value primitive columns#18797real-mj-song wants to merge 1 commit into
real-mj-song wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18797 +/- ##
============================================
- Coverage 64.78% 64.78% -0.01%
Complexity 1309 1309
============================================
Files 3381 3386 +5
Lines 209949 210259 +310
Branches 32887 32960 +73
============================================
+ Hits 136013 136213 +200
- Misses 62977 63061 +84
- Partials 10959 10985 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ve columns The column-major build added in apache#18638 removed per-row GenericRow materialization but still consumes each value through the generic Object path (next() -> collect(Object) / indexOfSV(Object) / add(Object, dictId)), so every primitive is boxed on both the stats pass and the index pass. Wire the already-existing typed accessors and primitive sinks together for single-value INT/LONG/FLOAT/DOUBLE columns: read via ColumnReader.nextInt()/... (gated on isInt()/isLong()/...) and feed collect(int), indexOfSV(int), and IndexCreator.addInt(value, dictId) directly. Nulls route through the unchanged Object path for exact parity (including null-value-vector marking and default substitution); multi-value, BOOLEAN/TIMESTAMP, and every other type fall back unchanged. Also document the two-pass rewind/memory precondition on the init(config, ColumnReaderFactory) overload. Completes the per-primitive boxing removal tracked in apache#18629.
845a0ea to
3368c9d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR: The column-major build added in #18638 removed per-row
GenericRowmaterialization, but it still consumes every value through the genericObjectpath — so each primitive is boxed twice, once on the stats pass and once on the index pass. This PR adds a typed, boxing-free fast path for single-valueINT/LONG/FLOAT/DOUBLEcolumns:ColumnReader.nextInt()→collect(int)/indexOfSV(int)→IndexCreator.addInt(value, dictId). Purely additive; nulls, multi-value, and every other type fall back to the unchanged Object path. No SPI changes, no new public API.Tracks #18629. Builds on #18638 (column-major build) and the
ColumnReader/ColumnReaderFactorySPI (#16727).Problem
#18629 set out to build a segment from a columnar source without "the per-row
GenericRowallocation and per-primitive boxing." #18638 delivered the column-major path and removed theGenericRowallocation, but it routes values through the genericObjectAPI:So the per-primitive boxing #18629 aimed to remove is still on the hot path — and twice over, since
buildColumnar()reads every column once for stats and once for indexing. (This is the follow-up requested in the #18638 review.)What this PR does
The
ColumnReaderSPI (#16727) already exposes typed accessors (nextInt(),isInt(), …) and the downstream sinks already have primitive overloads —AbstractColumnStatisticsCollector.collect(int),SegmentDictionaryCreator.indexOfSV(int), andaddInt(value, dictId)onForwardIndexCreator/CombinedInvertedIndexCreator/DictionaryBasedInvertedIndexCreator. They were simply never wired to the column-major build. This PR wires them:ColumnarSegmentPreIndexStatsContainer) — when the reader can serve a single-value column as a primitive (isInt()/isLong()/…), read withnextInt()/… and feedcollect(int)/… directly.SegmentColumnarIndexCreator) — same dispatch:nextInt()→indexOfSV(int)→creator.addInt(value, dictId)across the column's index creators (creators with no typed override keep the boxing default and stay correct).Key choices:
FieldSpectype (INT/LONG/FLOAT/DOUBLE), soBOOLEAN(storedINT) andTIMESTAMP(storedLONG) — which require value coercion the typed read would skip — correctly stay on the Object path.isNextNull()is false.Net effect: for single-value primitive columns, neither build pass boxes — removing the remaining per-primitive boxing on both column reads.
Potential saving
Each column is read twice (stats pass + index pass), so the Object path boxes every single-value primitive value
2 × Ntimes for anN-row column. The typed path avoids, per such column:INT/FLOAT2 × N × 16 BLONG/DOUBLE2 × N × 24 BWrapper sizes assume a 64-bit JVM with compressed oops (
Integer/Longskip the −128..127 cache). For example a 1M-row column eliminates ~32 MB (INT/FLOAT) or ~48 MB (LONG/DOUBLE) of short-lived allocation and2 × Nboxing ops — allocation/GC pressure only; retained heap and segment size are unchanged.Scope
Single-value
INT/LONG/FLOAT/DOUBLEonly. The typed multi-value sinks (addIntMV,nextIntMV/MultiValueResult) already exist but are intentionally not wired here — the multi-value element-null contract is a distinct concern and is better handled on its own.Compatibility
Additive and behavior-preserving: no SPI changes, no new public API; the row-major path and the Object-path column-major fallback are untouched. The only non-fast-path edit documents the existing two-pass
rewind()precondition on theinit(config, ColumnReaderFactory)overload (the consumer that imposes it), leaving theColumnReaderFactorySPI itself generic.Testing
This is a behavior-preserving (output-identical) de-box, already covered by the existing column-major equivalence suites — no new tests are needed:
ColumnarRowMajorEquivalenceTestbuilds a segment column-major vs. row-major over single-valueINT/LONG/FLOAT/DOUBLEcolumns with ~10% nulls and asserts per-doc value and min/max/cardinality equality — exercising the typed dispatch and its null branch.pinot-arrowcolumn-major suite (e.g.testRichMultiBatchEquivalence, INT with nulls across batches) exercises the same typed path via an Arrow source.References
Tracks #18629 · builds on #18638 ·
ColumnReaderSPI #16727