Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly by xiangfu0 · Pull Request #18756 · apache/pinot

xiangfu0 · 2026-06-14T07:51:09Z

What

A dataType: JSON column with a JSON index pays a serialize → re-parse round-trip per row: DataTypeTransformer serializes the parsed Map to a string for the forward index, and the mutable JSON index then re-parses that same string (stringToJsonNode → flatten). For large documents (log messages with stack traces) the re-tokenization dominates ingestion CPU.

This caches the already-parsed Map on the GenericRow and feeds it to the JSON index, which flattens it directly:

JsonUtils.flattenParsed(Object) — flatten a parsed Map/List/JsonNode via valueToTree, skipping string tokenization.
MutableJsonIndex.addParsed(Object) + MutableJsonIndexImpl override — index the parsed value directly.
GenericRow — transient per-row parsed-value cache (cleared per row; not part of value/equality/copy/serialized state).
DataTypeTransformer — caches the Map only for JSON columns that have a JSON-family index (computed once at construction).
MutableSegmentImpl — feeds the cached Map to the index, gated on supportsParsedValue().

Behavior-preserving

flattenParsed produces records identical to the old serialize+reparse path. For non-JSON-native leaf types (e.g. a BigDecimal/Float placed on a JSON column by a non-JSON RecordReader) valueToTree would not round-trip identically, so it falls back to serialize+reparse — byte-identical to today. Verified by JsonUtilsTest#testFlattenParsedValueMatchesString (diverse leaf types) and JsonIndexTest#testMutableJsonIndexParsedMatchesString (identical getMatchingDocIds via both paths).
supportsParsedValue() (default false) gates feeding the parsed value: an index that doesn't override addParsed keeps getting the already-serialized string, so there's no extra serialize / regression. MutableJsonIndexImpl opts in.
Detection covers any JSON-family index (the json index, or a plugin index whose id contains "json"); those benefit once their mutable index overrides addParsed + supportsParsedValue.

Performance (`BenchmarkJsonFlatten` / real `MutableJsonIndexImpl`)

message	re-parse path (today)	parsed path (this PR)	gain
~330 B	460 docs/ms	508 docs/ms	~1.0x
~2.9 KB	152 docs/ms	515 docs/ms	3.4x
~8 KB	56 docs/ms	400 docs/ms	6-7x

addParsed(Map) is ~constant; add(String) degrades with size because it re-tokenizes the document. So big-document (log) tables win 3-7x; tiny JSON is unchanged (which is why the cache is gated on having a JSON index).

SPI surface

New default methods on MutableJsonIndex (addParsed, supportsParsedValue), new public GenericRow accessors, new JsonUtils.flattenParsed overload — all additive (source/binary compatible). Implementers of MutableJsonIndex should note the add(Object,…) default dispatch now routes Map/List to addParsed (other types still fail fast).

Copilot

Pull request overview

This PR optimizes realtime ingestion for JSON columns with JSON-family indexes by avoiding a per-row JSON serialize → re-parse round-trip, reusing the already-parsed representation and flattening it directly for indexing.

Changes:

Add JsonUtils.flattenParsed(...) and a native-type check to flatten parsed JSON values (Map/List/JsonNode) without string tokenization, with a fallback to the existing string path for non-native leaf types.
Introduce a transient per-row parsed JSON cache on GenericRow, populate it in DataTypeTransformer for JSON-indexed JSON columns, and feed it to JSON mutable indexes when supportsParsedValue() is enabled.
Extend MutableJsonIndex with additive SPI defaults (addParsed, supportsParsedValue) and implement the parsed path in MutableJsonIndexImpl, with accompanying unit tests and a JMH benchmark.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java	Adds `flattenParsed` and native-type detection to avoid re-tokenizing JSON strings during indexing.
pinot-spi/src/test/java/org/apache/pinot/spi/utils/JsonUtilsTest.java	Adds coverage ensuring `flattenParsed` matches the legacy serialize+reparse behavior.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/GenericRow.java	Adds transient per-row parsed JSON cache accessors and clears it on `clear()`.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableJsonIndex.java	Adds additive SPI hooks for parsed JSON ingestion and capability gating.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java	Computes which JSON columns should cache parsed values and populates the cache during transform.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java	Feeds cached parsed JSON to mutable JSON indexes when supported.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java	Implements `addParsed` + `supportsParsedValue` using `flattenParsed`.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/JsonIndexTest.java	Adds test asserting parsed-vs-string indexing produces identical matches.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformerTest.java	Adds test verifying caching is enabled only when JSON index is configured.
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonFlatten.java	Adds a JMH benchmark comparing flatten-from-string vs flatten-from-parsed-map costs.

+        if (!_jsonCacheColumns.isEmpty() && _jsonCacheColumns.contains(column)
+            && (value instanceof Map || value instanceof List)) {


  default void add(Object value, int dictId, int docId) {
    try {
-      if (value instanceof Map) {
-        add(JsonUtils.objectToString(value));
+      if (value instanceof Map || value instanceof List) {
+        // Already-parsed JSON value (e.g. a Map cached on the GenericRow before it was serialized for the forward
+        // index): flatten it directly, avoiding the serialize-then-reparse round-trip.
+        addParsed(value);
      } else {
+        // String (the common case) or, for any other unexpected type, fail fast with a ClassCastException as before.
        add((String) value);
      }


codecov-commenter · 2026-06-14T08:44:22Z

Codecov Report

❌ Patch coverage is 81.89655% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.78%. Comparing base (5617ee7) to head (8fcae8f).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...ain/java/org/apache/pinot/spi/utils/JsonUtils.java	81.81%	6 Missing and 2 partials ⚠️
...t/local/recordtransformer/DataTypeTransformer.java	87.17%	3 Missing and 2 partials ⚠️
...ot/segment/spi/index/mutable/MutableJsonIndex.java	0.00%	5 Missing ⚠️
...local/indexsegment/mutable/MutableSegmentImpl.java	50.00%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18756      +/-   ##
============================================
+ Coverage     64.76%   64.78%   +0.01%     
  Complexity     1309     1309              
============================================
  Files          3380     3380              
  Lines        209573   209656      +83     
  Branches      32805    32823      +18     
============================================
+ Hits         135735   135828      +93     
+ Misses        62914    62905       -9     
+ Partials      10924    10923       -1

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.78% <81.89%> (+0.01%)`	⬆️
temurin	`64.78% <81.89%> (+0.01%)`	⬆️
unittests	`64.78% <81.89%> (+0.01%)`	⬆️
unittests1	`56.96% <73.27%> (+<0.01%)`	⬆️
unittests2	`37.28% <69.82%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…n it directly A JSON-typed column with a JSON index is parsed twice per row: the value (JSON text in the common case) is parsed and re-serialized to a canonical string for the forward index, and the JSON index then re-parses that string. For large documents (e.g. log messages with stack traces) the re-tokenization dominates. DataTypeTransformer now parses the JSON once: it caches the parsed value on the GenericRow and reuses its canonical string for the forward index, so the JSON index flattens the cached value via the new MutableJsonIndex.addParsed(Object) / JsonUtils.flattenParsed(Object) instead of re-parsing. This covers JSON columns fed as text (parse the string once, cache the JsonNode), as Maps/Lists, or as JsonNodes. Behavior-preserving (the flattened records are byte-for-byte identical to the string path): flattenParsed re-parses each DecimalNode leaf the way the string path does (so an integral decimal like 2.0 -> "2", scientific notation, and values past 2^53 stay identical) and falls back to serialize+reparse only for Float / byte[] leaves from a non-JSON RecordReader. Gated by supportsParsedValue() so an index that does not optimize the parsed path keeps getting the serialized string. The cache self-invalidates: GenericRow.putValue/putValues/removeValue/ putDefaultNullValue drop the parsed entry, so a transformer that rewrites the value after DataTypeTransformer (e.g. SanitizationTransformer trimming an over-length JSON string) cannot leave a stale parsed node for the index. SPI surface (all additive, source/binary compatible): MutableJsonIndex.addParsed + supportsParsedValue default methods, GenericRow parsed-value cache accessors, JsonUtils.flattenParsed overload. ~3-7x JSON-index ingestion throughput on multi-KB documents (BenchmarkJsonFlatten); ~no change for tiny JSON.

xiangfu0 requested review from Jackie-Jiang, Copilot and raghavyadav01 June 14, 2026 07:52

Copilot started reviewing on behalf of xiangfu0 June 14, 2026 07:52 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

xiangfu0 force-pushed the st-json-index-cache branch 2 times, most recently from fd6f6b2 to 577accd Compare June 14, 2026 22:30

xiangfu0 force-pushed the st-json-index-cache branch from 577accd to 8fcae8f Compare June 15, 2026 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:st-json-index-cache

xiangfu0 commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if (!_jsonCacheColumns.isEmpty() && _jsonCacheColumns.contains(column)
		&& (value instanceof Map \|\| value instanceof List)) {

Conversation

xiangfu0 commented Jun 14, 2026

What

Behavior-preserving

Performance (BenchmarkJsonFlatten / real MutableJsonIndexImpl)

SPI surface

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Performance (`BenchmarkJsonFlatten` / real `MutableJsonIndexImpl`)

codecov-commenter commented Jun 14, 2026 •

edited

Loading