Skip to content

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:st-json-index-cache
Open

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:st-json-index-cache

Conversation

@xiangfu0

Copy link
Copy Markdown
Contributor

What

A dataType: JSON column with a JSON index pays a serialize → re-parse round-trip per row: DataTypeTransformer serializes the parsed Map to a string for the forward index, and the mutable JSON index then re-parses that same string (stringToJsonNode → flatten). For large documents (log messages with stack traces) the re-tokenization dominates ingestion CPU.

This caches the already-parsed Map on the GenericRow and feeds it to the JSON index, which flattens it directly:

  • JsonUtils.flattenParsed(Object) — flatten a parsed Map/List/JsonNode via valueToTree, skipping string tokenization.
  • MutableJsonIndex.addParsed(Object) + MutableJsonIndexImpl override — index the parsed value directly.
  • GenericRow — transient per-row parsed-value cache (cleared per row; not part of value/equality/copy/serialized state).
  • DataTypeTransformer — caches the Map only for JSON columns that have a JSON-family index (computed once at construction).
  • MutableSegmentImpl — feeds the cached Map to the index, gated on supportsParsedValue().

Behavior-preserving

  • flattenParsed produces records identical to the old serialize+reparse path. For non-JSON-native leaf types (e.g. a BigDecimal/Float placed on a JSON column by a non-JSON RecordReader) valueToTree would not round-trip identically, so it falls back to serialize+reparse — byte-identical to today. Verified by JsonUtilsTest#testFlattenParsedValueMatchesString (diverse leaf types) and JsonIndexTest#testMutableJsonIndexParsedMatchesString (identical getMatchingDocIds via both paths).
  • supportsParsedValue() (default false) gates feeding the parsed value: an index that doesn't override addParsed keeps getting the already-serialized string, so there's no extra serialize / regression. MutableJsonIndexImpl opts in.
  • Detection covers any JSON-family index (the json index, or a plugin index whose id contains "json"); those benefit once their mutable index overrides addParsed + supportsParsedValue.

Performance (BenchmarkJsonFlatten / real MutableJsonIndexImpl)

message re-parse path (today) parsed path (this PR) gain
~330 B 460 docs/ms 508 docs/ms ~1.0x
~2.9 KB 152 docs/ms 515 docs/ms 3.4x
~8 KB 56 docs/ms 400 docs/ms 6-7x

addParsed(Map) is ~constant; add(String) degrades with size because it re-tokenizes the document. So big-document (log) tables win 3-7x; tiny JSON is unchanged (which is why the cache is gated on having a JSON index).

SPI surface

New default methods on MutableJsonIndex (addParsed, supportsParsedValue), new public GenericRow accessors, new JsonUtils.flattenParsed overload — all additive (source/binary compatible). Implementers of MutableJsonIndex should note the add(Object,…) default dispatch now routes Map/List to addParsed (other types still fail fast).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes realtime ingestion for JSON columns with JSON-family indexes by avoiding a per-row JSON serialize → re-parse round-trip, reusing the already-parsed representation and flattening it directly for indexing.

Changes:

  • Add JsonUtils.flattenParsed(...) and a native-type check to flatten parsed JSON values (Map/List/JsonNode) without string tokenization, with a fallback to the existing string path for non-native leaf types.
  • Introduce a transient per-row parsed JSON cache on GenericRow, populate it in DataTypeTransformer for JSON-indexed JSON columns, and feed it to JSON mutable indexes when supportsParsedValue() is enabled.
  • Extend MutableJsonIndex with additive SPI defaults (addParsed, supportsParsedValue) and implement the parsed path in MutableJsonIndexImpl, with accompanying unit tests and a JMH benchmark.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java Adds flattenParsed and native-type detection to avoid re-tokenizing JSON strings during indexing.
pinot-spi/src/test/java/org/apache/pinot/spi/utils/JsonUtilsTest.java Adds coverage ensuring flattenParsed matches the legacy serialize+reparse behavior.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/GenericRow.java Adds transient per-row parsed JSON cache accessors and clears it on clear().
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableJsonIndex.java Adds additive SPI hooks for parsed JSON ingestion and capability gating.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java Computes which JSON columns should cache parsed values and populates the cache during transform.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java Feeds cached parsed JSON to mutable JSON indexes when supported.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java Implements addParsed + supportsParsedValue using flattenParsed.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/JsonIndexTest.java Adds test asserting parsed-vs-string indexing produces identical matches.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformerTest.java Adds test verifying caching is enabled only when JSON index is configured.
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonFlatten.java Adds a JMH benchmark comparing flatten-from-string vs flatten-from-parsed-map costs.

Comment on lines +119 to +120
if (!_jsonCacheColumns.isEmpty() && _jsonCacheColumns.contains(column)
&& (value instanceof Map || value instanceof List)) {
Comment on lines 33 to 42
default void add(Object value, int dictId, int docId) {
try {
if (value instanceof Map) {
add(JsonUtils.objectToString(value));
if (value instanceof Map || value instanceof List) {
// Already-parsed JSON value (e.g. a Map cached on the GenericRow before it was serialized for the forward
// index): flatten it directly, avoiding the serialize-then-reparse round-trip.
addParsed(value);
} else {
// String (the common case) or, for any other unexpected type, fail fast with a ClassCastException as before.
add((String) value);
}
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.89655% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.78%. Comparing base (5617ee7) to head (8fcae8f).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...ain/java/org/apache/pinot/spi/utils/JsonUtils.java 81.81% 6 Missing and 2 partials ⚠️
...t/local/recordtransformer/DataTypeTransformer.java 87.17% 3 Missing and 2 partials ⚠️
...ot/segment/spi/index/mutable/MutableJsonIndex.java 0.00% 5 Missing ⚠️
...local/indexsegment/mutable/MutableSegmentImpl.java 50.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18756      +/-   ##
============================================
+ Coverage     64.76%   64.78%   +0.01%     
  Complexity     1309     1309              
============================================
  Files          3380     3380              
  Lines        209573   209656      +83     
  Branches      32805    32823      +18     
============================================
+ Hits         135735   135828      +93     
+ Misses        62914    62905       -9     
+ Partials      10924    10923       -1     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.78% <81.89%> (+0.01%) ⬆️
temurin 64.78% <81.89%> (+0.01%) ⬆️
unittests 64.78% <81.89%> (+0.01%) ⬆️
unittests1 56.96% <73.27%> (+<0.01%) ⬆️
unittests2 37.28% <69.82%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the st-json-index-cache branch 2 times, most recently from fd6f6b2 to 577accd Compare June 14, 2026 22:30
…n it directly

A JSON-typed column with a JSON index is parsed twice per row: the value (JSON
text in the common case) is parsed and re-serialized to a canonical string for
the forward index, and the JSON index then re-parses that string. For large
documents (e.g. log messages with stack traces) the re-tokenization dominates.

DataTypeTransformer now parses the JSON once: it caches the parsed value on the
GenericRow and reuses its canonical string for the forward index, so the JSON
index flattens the cached value via the new MutableJsonIndex.addParsed(Object) /
JsonUtils.flattenParsed(Object) instead of re-parsing. This covers JSON columns
fed as text (parse the string once, cache the JsonNode), as Maps/Lists, or as
JsonNodes.

Behavior-preserving (the flattened records are byte-for-byte identical to the
string path): flattenParsed re-parses each DecimalNode leaf the way the string
path does (so an integral decimal like 2.0 -> "2", scientific notation, and
values past 2^53 stay identical) and falls back to serialize+reparse only for
Float / byte[] leaves from a non-JSON RecordReader. Gated by supportsParsedValue()
so an index that does not optimize the parsed path keeps getting the serialized
string.

The cache self-invalidates: GenericRow.putValue/putValues/removeValue/
putDefaultNullValue drop the parsed entry, so a transformer that rewrites the
value after DataTypeTransformer (e.g. SanitizationTransformer trimming an
over-length JSON string) cannot leave a stale parsed node for the index.

SPI surface (all additive, source/binary compatible): MutableJsonIndex.addParsed
+ supportsParsedValue default methods, GenericRow parsed-value cache accessors,
JsonUtils.flattenParsed overload.

~3-7x JSON-index ingestion throughput on multi-KB documents (BenchmarkJsonFlatten);
~no change for tiny JSON.
@xiangfu0 xiangfu0 force-pushed the st-json-index-cache branch from 577accd to 8fcae8f Compare June 15, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants