Add dictionary-based TEXT index option (buildOnDictionary) by xiangfu0 · Pull Request #18758 · apache/pinot

xiangfu0 · 2026-06-14T18:59:55Z

Description

Adds an opt-in buildOnDictionary flag (default false) for the Lucene TEXT index. When enabled on a dictionary-encoded STRING column, the Lucene index is built over the column dictionary — one document per distinct value, with the Lucene docId equal to the dictId — instead of one document per row.

This makes the text index size scale with cardinality rather than row count, which is a large win for low/medium-cardinality text columns, while keeping the full Lucene query surface. TEXT_MATCH returns the matching dictIds, which are resolved to docIds through the existing dictionary-based filter operators (inverted index / scan) — mirroring the FST / REGEXP_LIKE path.

Existing behavior is unchanged when the flag is off.

Design (mirrors the FST dictionary pattern)

Config: FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY + TextIndexConfig.isBuildOnDictionary() (default false), parsed in TextIndexConfigBuilder.
Creation: TextIndexType.createIndexCreator feeds the sorted dictionary values to a dictionary-mode LuceneTextIndexCreator (the per-row add() becomes a no-op). Dictionary mode is gated to commit == true, so consuming (realtime) segments keep building per-row and the dictionary-based index is produced on seal/convert and on offline ingestion. Reload is handled by TextIndexHandler looping the dictionary, mirroring FSTIndexHandler.
Self-describing segments (mixed-table safe): the build mode is persisted into the segment's lucene.properties, and the reader derives its mode from the segment, not the live table config. A table holding a mix of dictionary-based and per-row text segments (e.g. during rollout) reads each correctly. The table-config flag governs creation only.
Reader: LuceneTextIndexReader serves getDictIds() in dictionary mode and uses a no-op docId translator (Lucene docId already equals dictId).
Query: new TextMatchDictIdPredicateEvaluatorFactory; FilterPlanNode routes a dictionary-based TEXT_MATCH through FilterOperatorUtils to the standard scan/inverted operators. Single-value and multi-value STRING columns supported.
Validation (hard failure): buildOnDictionary on a non-dictionary-encoded column is rejected at table-config validation and again by a creation-time precondition.

Limitations / follow-ups

TEXT_MATCH options are not supported in dictionary mode (the dictionary lookup takes only the query value) — such queries are rejected rather than silently ignored.
Realtime consuming segments always build the per-row index; the dictionary-based index is produced when the segment is sealed/converted.

Upgrade Notes

Opt-in and default-off, so no behavior change on upgrade. The build mode is recorded per segment, so a table can hold a mix of dictionary-based and per-row text segments during a rolling enablement and each is read correctly. Flipping the flag triggers a text-index rebuild on segment reload.

Testing

DictionaryBasedTextIndexQueriesTest (new):

dictionary-based TEXT_MATCH results equal the per-row index for SV and MV columns, including OR and zero-match queries;
reload builds the dictionary-based index from the dictionary;
the dictionary-based segment is smaller than per-row for a low-cardinality column;
a non-dictionary column with buildOnDictionary=true fails the build;
TEXT_MATCH with options is rejected for a dictionary-based index.

TextIndexConfigTest: round-trips the new flag.

Regression (all green): per-row TextSearchQueriesTest, FSTBasedRegexpLikeQueriesTest, IFSTBasedRegexpLikeQueriesTest, and segment-local LuceneTextIndexCreatorTest / TextIndexTest / LuceneTextIndexConfigReloadTest / TextIndexUtilsTest. spotless / license / checkstyle clean.

Labels

feature, text-index

…lse) Adds an opt-in buildOnDictionary flag for the Lucene TEXT index. When enabled on a dictionary-encoded STRING column, the Lucene index is built over the column dictionary (one document per distinct value, Lucene docId == dictId) instead of one document per row. This makes the index size scale with cardinality rather than row count, and TEXT_MATCH resolves the matching dictIds to docIds through the existing dictionary-based filter operators (inverted index / scan), mirroring the FST/REGEXP_LIKE path. Details: - Config: FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY + TextIndexConfig .isBuildOnDictionary() (default false), parsed in TextIndexConfigBuilder. - Creation: TextIndexType.createIndexCreator feeds the sorted dictionary values to a dict-mode LuceneTextIndexCreator (per-row add() is a no-op; dict mode is gated to commit==true so consuming segments keep building per-row). Reload is handled by TextIndexHandler looping the dictionary, mirroring FSTIndexHandler. - Self-describing segment: the build mode is persisted in lucene.properties; the reader derives its mode from the segment, not the live table config, so a table holding a mix of dictionary-based and per-row text segments reads each correctly. The table-config flag governs creation only. - Reader: LuceneTextIndexReader serves getDictIds() in dictionary mode and uses a no-op docId translator (Lucene docId already equals dictId). - Query: new TextMatchDictIdPredicateEvaluatorFactory; FilterPlanNode routes a dictionary-based TEXT_MATCH through FilterOperatorUtils. SV and MV supported. TEXT_MATCH options are not supported in dictionary mode (rejected). - Validation: buildOnDictionary requires a dictionary-encoded column (hard failure at table-config validation and a creation-time precondition). Tests: dictionary-based TEXT_MATCH equals the per-row index for SV/MV, OR and zero-match queries; reload builds the index from the dictionary; size is smaller for low cardinality; non-dictionary column and TEXT_MATCH options are rejected; TextIndexConfig round-trips the flag.

codecov-commenter · 2026-06-14T19:52:27Z

Codecov Report

❌ Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.80%. Comparing base (b0cb24b) to head (8149c49).
⚠️ Report is 21 commits behind head on master.

Files with missing lines	Patch %	Lines
...pache/pinot/segment/spi/index/TextIndexConfig.java	64.70%	6 Missing ⚠️
...egment/local/segment/index/text/TextIndexType.java	80.00%	2 Missing and 1 partial ⚠️
...ment/creator/impl/text/LuceneTextIndexCreator.java	80.00%	0 Missing and 2 partials ⚠️
...ment/index/readers/text/LuceneTextIndexReader.java	88.23%	1 Missing and 1 partial ⚠️
...cate/TextMatchDictIdPredicateEvaluatorFactory.java	94.11%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18758      +/-   ##
============================================
+ Coverage     64.59%   64.80%   +0.21%     
- Complexity     1305     1309       +4     
============================================
  Files          3373     3381       +8     
  Lines        208894   209633     +739     
  Branches      32667    32816     +149     
============================================
+ Hits         134937   135857     +920     
+ Misses        63116    62838     -278     
- Partials      10841    10938      +97

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-21	`64.80% <85.10%> (+0.21%)`	⬆️
temurin	`64.80% <85.10%> (+0.21%)`	⬆️
unittests	`64.80% <85.10%> (+0.21%)`	⬆️
unittests1	`57.01% <84.04%> (-0.02%)`	⬇️
unittests2	`37.25% <15.95%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds an opt-in buildOnDictionary mode for Lucene TEXT indexes so that (for dictionary-encoded STRING columns) the Lucene index can be built per distinct dictionary value (docId == dictId) rather than per row, reducing index size for lower-cardinality text columns while preserving TEXT_MATCH semantics via dictId→docId resolution through existing dictionary-based filter operators.

Changes:

Introduces buildOnDictionary config plumbed through FieldConfig/TextIndexConfig, parsing, equality/hash, and persistence to lucene.properties.
Implements dictionary-mode creation/handling (segment creation + reload) and reader support (getDictIds(), no-op docId translator).
Routes dictionary-based TEXT_MATCH through a new dict-id predicate evaluator + adds an end-to-end query test suite.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java	Adds `TEXT_INDEX_BUILD_ON_DICTIONARY` field property constant.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java	Adds `buildOnDictionary` to TEXT index config + builder + equality/hash.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/TextIndexReader.java	Adds `isBuildOnDictionary()` default contract for readers.
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/TextIndexConfigTest.java	Adds JSON parsing/round-trip test for `buildOnDictionary`.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java	Persists/recovers `buildOnDictionary` in Lucene properties.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexType.java	Validates/enforces dictionary mode and seeds creator from unique dictionary values.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexConfigBuilder.java	Parses `buildOnDictionary` from field properties.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java	Implements dictionary-mode reading (`isBuildOnDictionary`, `getDictIds`, no-op translator).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/TextIndexHandler.java	Builds dictionary-mode index on reload by iterating dictionary values.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java	Makes per-row add() callbacks no-ops in dictionary mode; disables reuse-mutable-index for dict mode.
pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java	Routes dictionary-mode `TEXT_MATCH` via dict-id predicate evaluator + dictionary-based filter operators.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/TextMatchDictIdPredicateEvaluatorFactory.java	New predicate evaluator factory returning matching dictIds bitmap.
pinot-core/src/test/java/org/apache/pinot/queries/DictionaryBasedTextIndexQueriesTest.java	New end-to-end tests comparing per-row vs dictionary-mode `TEXT_MATCH` and reload behavior.

… absent) When a text-index segment's persisted lucene.properties does not contain the buildOnDictionary key (segments written before this feature), the reader now defaults to false instead of falling back to the live table-config value. This preserves the segment-self-describes guarantee: a per-row segment can never be misread as dictionary-based after the flag is flipped on at the table level. - TextIndexUtils.getUpdatedConfigFromPropertiesFile: default false when absent. - LuceneTextIndexReader.updateConfigFromProperties (buffer path): default false when absent. - Test: segmentWithoutPersistedFlagReadsAsPerRow strips the key from a per-row segment's properties, reloads under a buildOnDictionary=true table config, and asserts the segment is still read as per-row with correct TEXT_MATCH results.

The Pinot Binary Compatibility Check (japicmp) flagged the buildOnDictionary parameter added to the public TextIndexConfig @JsonCreator constructor as a removed constructor in the pinot-segment-spi SPI artifact. Restore the prior 19-arg public constructor (delegating to the canonical constructor with buildOnDictionary defaulted to false) so old compiled callers keep linking. Verified locally with japicmp 0.23.1 --error-on-binary-incompatibility: both pinot-spi and pinot-segment-spi report no binary-incompatible changes.

Addresses the codecov patch-coverage report on the PR by covering the previously-untested feature paths: - store-in-segment-file (buffer-based reader) path in dictionary mode, which exercises the buffer LuceneTextIndexReader constructor and the buffer property overlay; - a TEXT_MATCH that matches every distinct dictionary value, exercising the always-true short-circuit in the dict-id predicate evaluator; - TextIndexConfig: buildOnDictionary participation in equals/hashCode and the retained pre-buildOnDictionary (binary-compat) constructor defaulting to false. Test-only change.

xiangfu0 added feature New functionality text-search Related to text/Lucene indexing and search index Related to indexing (general) labels Jun 15, 2026

xiangfu0 requested review from Jackie-Jiang, Copilot and raghavyadav01 June 15, 2026 08:01

Copilot started reviewing on behalf of xiangfu0 June 15, 2026 08:02 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread ...segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java Outdated

Comment thread ...in/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java Outdated

xiangfu0 added 3 commits June 15, 2026 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictionary-based TEXT index option (buildOnDictionary)#18758

Add dictionary-based TEXT index option (buildOnDictionary)#18758
xiangfu0 wants to merge 4 commits into
apache:masterfrom
xiangfu0:claude/dict-based-text-index

xiangfu0 commented Jun 14, 2026

Uh oh!

codecov-commenter commented Jun 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiangfu0 commented Jun 14, 2026

Description

Design (mirrors the FST dictionary pattern)

Limitations / follow-ups

Upgrade Notes

Testing

Labels

Uh oh!

codecov-commenter commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 14, 2026 •

edited

Loading