Add dictionary-based TEXT index option (buildOnDictionary)#18758
Add dictionary-based TEXT index option (buildOnDictionary)#18758xiangfu0 wants to merge 4 commits into
Conversation
…lse) Adds an opt-in buildOnDictionary flag for the Lucene TEXT index. When enabled on a dictionary-encoded STRING column, the Lucene index is built over the column dictionary (one document per distinct value, Lucene docId == dictId) instead of one document per row. This makes the index size scale with cardinality rather than row count, and TEXT_MATCH resolves the matching dictIds to docIds through the existing dictionary-based filter operators (inverted index / scan), mirroring the FST/REGEXP_LIKE path. Details: - Config: FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY + TextIndexConfig .isBuildOnDictionary() (default false), parsed in TextIndexConfigBuilder. - Creation: TextIndexType.createIndexCreator feeds the sorted dictionary values to a dict-mode LuceneTextIndexCreator (per-row add() is a no-op; dict mode is gated to commit==true so consuming segments keep building per-row). Reload is handled by TextIndexHandler looping the dictionary, mirroring FSTIndexHandler. - Self-describing segment: the build mode is persisted in lucene.properties; the reader derives its mode from the segment, not the live table config, so a table holding a mix of dictionary-based and per-row text segments reads each correctly. The table-config flag governs creation only. - Reader: LuceneTextIndexReader serves getDictIds() in dictionary mode and uses a no-op docId translator (Lucene docId already equals dictId). - Query: new TextMatchDictIdPredicateEvaluatorFactory; FilterPlanNode routes a dictionary-based TEXT_MATCH through FilterOperatorUtils. SV and MV supported. TEXT_MATCH options are not supported in dictionary mode (rejected). - Validation: buildOnDictionary requires a dictionary-encoded column (hard failure at table-config validation and a creation-time precondition). Tests: dictionary-based TEXT_MATCH equals the per-row index for SV/MV, OR and zero-match queries; reload builds the index from the dictionary; size is smaller for low cardinality; non-dictionary column and TEXT_MATCH options are rejected; TextIndexConfig round-trips the flag.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18758 +/- ##
============================================
+ Coverage 64.59% 64.80% +0.21%
- Complexity 1305 1309 +4
============================================
Files 3373 3381 +8
Lines 208894 209633 +739
Branches 32667 32816 +149
============================================
+ Hits 134937 135857 +920
+ Misses 63116 62838 -278
- Partials 10841 10938 +97
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in buildOnDictionary mode for Lucene TEXT indexes so that (for dictionary-encoded STRING columns) the Lucene index can be built per distinct dictionary value (docId == dictId) rather than per row, reducing index size for lower-cardinality text columns while preserving TEXT_MATCH semantics via dictId→docId resolution through existing dictionary-based filter operators.
Changes:
- Introduces
buildOnDictionaryconfig plumbed throughFieldConfig/TextIndexConfig, parsing, equality/hash, and persistence tolucene.properties. - Implements dictionary-mode creation/handling (segment creation + reload) and reader support (
getDictIds(), no-op docId translator). - Routes dictionary-based
TEXT_MATCHthrough a new dict-id predicate evaluator + adds an end-to-end query test suite.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java | Adds TEXT_INDEX_BUILD_ON_DICTIONARY field property constant. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java | Adds buildOnDictionary to TEXT index config + builder + equality/hash. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/TextIndexReader.java | Adds isBuildOnDictionary() default contract for readers. |
| pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/TextIndexConfigTest.java | Adds JSON parsing/round-trip test for buildOnDictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java | Persists/recovers buildOnDictionary in Lucene properties. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexType.java | Validates/enforces dictionary mode and seeds creator from unique dictionary values. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexConfigBuilder.java | Parses buildOnDictionary from field properties. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java | Implements dictionary-mode reading (isBuildOnDictionary, getDictIds, no-op translator). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/TextIndexHandler.java | Builds dictionary-mode index on reload by iterating dictionary values. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java | Makes per-row add() callbacks no-ops in dictionary mode; disables reuse-mutable-index for dict mode. |
| pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java | Routes dictionary-mode TEXT_MATCH via dict-id predicate evaluator + dictionary-based filter operators. |
| pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/TextMatchDictIdPredicateEvaluatorFactory.java | New predicate evaluator factory returning matching dictIds bitmap. |
| pinot-core/src/test/java/org/apache/pinot/queries/DictionaryBasedTextIndexQueriesTest.java | New end-to-end tests comparing per-row vs dictionary-mode TEXT_MATCH and reload behavior. |
… absent) When a text-index segment's persisted lucene.properties does not contain the buildOnDictionary key (segments written before this feature), the reader now defaults to false instead of falling back to the live table-config value. This preserves the segment-self-describes guarantee: a per-row segment can never be misread as dictionary-based after the flag is flipped on at the table level. - TextIndexUtils.getUpdatedConfigFromPropertiesFile: default false when absent. - LuceneTextIndexReader.updateConfigFromProperties (buffer path): default false when absent. - Test: segmentWithoutPersistedFlagReadsAsPerRow strips the key from a per-row segment's properties, reloads under a buildOnDictionary=true table config, and asserts the segment is still read as per-row with correct TEXT_MATCH results.
The Pinot Binary Compatibility Check (japicmp) flagged the buildOnDictionary parameter added to the public TextIndexConfig @JsonCreator constructor as a removed constructor in the pinot-segment-spi SPI artifact. Restore the prior 19-arg public constructor (delegating to the canonical constructor with buildOnDictionary defaulted to false) so old compiled callers keep linking. Verified locally with japicmp 0.23.1 --error-on-binary-incompatibility: both pinot-spi and pinot-segment-spi report no binary-incompatible changes.
Addresses the codecov patch-coverage report on the PR by covering the previously-untested feature paths: - store-in-segment-file (buffer-based reader) path in dictionary mode, which exercises the buffer LuceneTextIndexReader constructor and the buffer property overlay; - a TEXT_MATCH that matches every distinct dictionary value, exercising the always-true short-circuit in the dict-id predicate evaluator; - TextIndexConfig: buildOnDictionary participation in equals/hashCode and the retained pre-buildOnDictionary (binary-compat) constructor defaulting to false. Test-only change.
Description
Adds an opt-in
buildOnDictionaryflag (default false) for the Lucene TEXT index. When enabled on a dictionary-encodedSTRINGcolumn, the Lucene index is built over the column dictionary — one document per distinct value, with the Lucene docId equal to the dictId — instead of one document per row.This makes the text index size scale with cardinality rather than row count, which is a large win for low/medium-cardinality text columns, while keeping the full Lucene query surface.
TEXT_MATCHreturns the matching dictIds, which are resolved to docIds through the existing dictionary-based filter operators (inverted index / scan) — mirroring the FST /REGEXP_LIKEpath.Existing behavior is unchanged when the flag is off.
Design (mirrors the FST dictionary pattern)
FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY+TextIndexConfig.isBuildOnDictionary()(default false), parsed inTextIndexConfigBuilder.TextIndexType.createIndexCreatorfeeds the sorted dictionary values to a dictionary-modeLuceneTextIndexCreator(the per-rowadd()becomes a no-op). Dictionary mode is gated tocommit == true, so consuming (realtime) segments keep building per-row and the dictionary-based index is produced on seal/convert and on offline ingestion. Reload is handled byTextIndexHandlerlooping the dictionary, mirroringFSTIndexHandler.lucene.properties, and the reader derives its mode from the segment, not the live table config. A table holding a mix of dictionary-based and per-row text segments (e.g. during rollout) reads each correctly. The table-config flag governs creation only.LuceneTextIndexReaderservesgetDictIds()in dictionary mode and uses a no-op docId translator (Lucene docId already equals dictId).TextMatchDictIdPredicateEvaluatorFactory;FilterPlanNoderoutes a dictionary-basedTEXT_MATCHthroughFilterOperatorUtilsto the standard scan/inverted operators. Single-value and multi-value STRING columns supported.buildOnDictionaryon a non-dictionary-encoded column is rejected at table-config validation and again by a creation-time precondition.Limitations / follow-ups
TEXT_MATCHoptions are not supported in dictionary mode (the dictionary lookup takes only the query value) — such queries are rejected rather than silently ignored.Upgrade Notes
Opt-in and default-off, so no behavior change on upgrade. The build mode is recorded per segment, so a table can hold a mix of dictionary-based and per-row text segments during a rolling enablement and each is read correctly. Flipping the flag triggers a text-index rebuild on segment reload.
Testing
DictionaryBasedTextIndexQueriesTest(new):TEXT_MATCHresults equal the per-row index for SV and MV columns, includingORand zero-match queries;buildOnDictionary=truefails the build;TEXT_MATCHwith options is rejected for a dictionary-based index.TextIndexConfigTest: round-trips the new flag.Regression (all green): per-row
TextSearchQueriesTest,FSTBasedRegexpLikeQueriesTest,IFSTBasedRegexpLikeQueriesTest, and segment-localLuceneTextIndexCreatorTest/TextIndexTest/LuceneTextIndexConfigReloadTest/TextIndexUtilsTest.spotless/license/checkstyleclean.Labels
feature,text-index