Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756
Open
xiangfu0 wants to merge 1 commit into
Open
Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756xiangfu0 wants to merge 1 commit into
xiangfu0 wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes realtime ingestion for JSON columns with JSON-family indexes by avoiding a per-row JSON serialize → re-parse round-trip, reusing the already-parsed representation and flattening it directly for indexing.
Changes:
- Add
JsonUtils.flattenParsed(...)and a native-type check to flatten parsed JSON values (Map/List/JsonNode) without string tokenization, with a fallback to the existing string path for non-native leaf types. - Introduce a transient per-row parsed JSON cache on
GenericRow, populate it inDataTypeTransformerfor JSON-indexed JSON columns, and feed it to JSON mutable indexes whensupportsParsedValue()is enabled. - Extend
MutableJsonIndexwith additive SPI defaults (addParsed,supportsParsedValue) and implement the parsed path inMutableJsonIndexImpl, with accompanying unit tests and a JMH benchmark.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java | Adds flattenParsed and native-type detection to avoid re-tokenizing JSON strings during indexing. |
| pinot-spi/src/test/java/org/apache/pinot/spi/utils/JsonUtilsTest.java | Adds coverage ensuring flattenParsed matches the legacy serialize+reparse behavior. |
| pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/GenericRow.java | Adds transient per-row parsed JSON cache accessors and clears it on clear(). |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableJsonIndex.java | Adds additive SPI hooks for parsed JSON ingestion and capability gating. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java | Computes which JSON columns should cache parsed values and populates the cache during transform. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java | Feeds cached parsed JSON to mutable JSON indexes when supported. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java | Implements addParsed + supportsParsedValue using flattenParsed. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/JsonIndexTest.java | Adds test asserting parsed-vs-string indexing produces identical matches. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformerTest.java | Adds test verifying caching is enabled only when JSON index is configured. |
| pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonFlatten.java | Adds a JMH benchmark comparing flatten-from-string vs flatten-from-parsed-map costs. |
Comment on lines
+119
to
+120
| if (!_jsonCacheColumns.isEmpty() && _jsonCacheColumns.contains(column) | ||
| && (value instanceof Map || value instanceof List)) { |
Comment on lines
33
to
42
| default void add(Object value, int dictId, int docId) { | ||
| try { | ||
| if (value instanceof Map) { | ||
| add(JsonUtils.objectToString(value)); | ||
| if (value instanceof Map || value instanceof List) { | ||
| // Already-parsed JSON value (e.g. a Map cached on the GenericRow before it was serialized for the forward | ||
| // index): flatten it directly, avoiding the serialize-then-reparse round-trip. | ||
| addParsed(value); | ||
| } else { | ||
| // String (the common case) or, for any other unexpected type, fail fast with a ClassCastException as before. | ||
| add((String) value); | ||
| } |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18756 +/- ##
============================================
+ Coverage 64.76% 64.78% +0.01%
Complexity 1309 1309
============================================
Files 3380 3380
Lines 209573 209656 +83
Branches 32805 32823 +18
============================================
+ Hits 135735 135828 +93
+ Misses 62914 62905 -9
+ Partials 10924 10923 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
fd6f6b2 to
577accd
Compare
…n it directly A JSON-typed column with a JSON index is parsed twice per row: the value (JSON text in the common case) is parsed and re-serialized to a canonical string for the forward index, and the JSON index then re-parses that string. For large documents (e.g. log messages with stack traces) the re-tokenization dominates. DataTypeTransformer now parses the JSON once: it caches the parsed value on the GenericRow and reuses its canonical string for the forward index, so the JSON index flattens the cached value via the new MutableJsonIndex.addParsed(Object) / JsonUtils.flattenParsed(Object) instead of re-parsing. This covers JSON columns fed as text (parse the string once, cache the JsonNode), as Maps/Lists, or as JsonNodes. Behavior-preserving (the flattened records are byte-for-byte identical to the string path): flattenParsed re-parses each DecimalNode leaf the way the string path does (so an integral decimal like 2.0 -> "2", scientific notation, and values past 2^53 stay identical) and falls back to serialize+reparse only for Float / byte[] leaves from a non-JSON RecordReader. Gated by supportsParsedValue() so an index that does not optimize the parsed path keeps getting the serialized string. The cache self-invalidates: GenericRow.putValue/putValues/removeValue/ putDefaultNullValue drop the parsed entry, so a transformer that rewrites the value after DataTypeTransformer (e.g. SanitizationTransformer trimming an over-length JSON string) cannot leave a stale parsed node for the index. SPI surface (all additive, source/binary compatible): MutableJsonIndex.addParsed + supportsParsedValue default methods, GenericRow parsed-value cache accessors, JsonUtils.flattenParsed overload. ~3-7x JSON-index ingestion throughput on multi-KB documents (BenchmarkJsonFlatten); ~no change for tiny JSON.
577accd to
8fcae8f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A
dataType: JSONcolumn with a JSON index pays a serialize → re-parse round-trip per row:DataTypeTransformerserializes the parsedMapto a string for the forward index, and the mutable JSON index then re-parses that same string (stringToJsonNode→ flatten). For large documents (log messages with stack traces) the re-tokenization dominates ingestion CPU.This caches the already-parsed
Mapon theGenericRowand feeds it to the JSON index, which flattens it directly:JsonUtils.flattenParsed(Object)— flatten a parsedMap/List/JsonNodeviavalueToTree, skipping string tokenization.MutableJsonIndex.addParsed(Object)+MutableJsonIndexImploverride — index the parsed value directly.GenericRow— transient per-row parsed-value cache (cleared per row; not part of value/equality/copy/serialized state).DataTypeTransformer— caches theMaponly for JSON columns that have a JSON-family index (computed once at construction).MutableSegmentImpl— feeds the cachedMapto the index, gated onsupportsParsedValue().Behavior-preserving
flattenParsedproduces records identical to the old serialize+reparse path. For non-JSON-native leaf types (e.g. aBigDecimal/Floatplaced on a JSON column by a non-JSON RecordReader)valueToTreewould not round-trip identically, so it falls back to serialize+reparse — byte-identical to today. Verified byJsonUtilsTest#testFlattenParsedValueMatchesString(diverse leaf types) andJsonIndexTest#testMutableJsonIndexParsedMatchesString(identicalgetMatchingDocIdsvia both paths).supportsParsedValue()(defaultfalse) gates feeding the parsed value: an index that doesn't overrideaddParsedkeeps getting the already-serialized string, so there's no extra serialize / regression.MutableJsonIndexImplopts in.jsonindex, or a plugin index whose id contains "json"); those benefit once their mutable index overridesaddParsed+supportsParsedValue.Performance (
BenchmarkJsonFlatten/ realMutableJsonIndexImpl)addParsed(Map)is ~constant;add(String)degrades with size because it re-tokenizes the document. So big-document (log) tables win 3-7x; tiny JSON is unchanged (which is why the cache is gated on having a JSON index).SPI surface
New default methods on
MutableJsonIndex(addParsed,supportsParsedValue), new publicGenericRowaccessors, newJsonUtils.flattenParsedoverload — all additive (source/binary compatible). Implementers ofMutableJsonIndexshould note theadd(Object,…)default dispatch now routesMap/ListtoaddParsed(other types still fail fast).