Skip to content

Add dictionary-based TEXT index option (buildOnDictionary)#18758

Open
xiangfu0 wants to merge 4 commits into
apache:masterfrom
xiangfu0:claude/dict-based-text-index
Open

Add dictionary-based TEXT index option (buildOnDictionary)#18758
xiangfu0 wants to merge 4 commits into
apache:masterfrom
xiangfu0:claude/dict-based-text-index

Conversation

@xiangfu0

Copy link
Copy Markdown
Contributor

Description

Adds an opt-in buildOnDictionary flag (default false) for the Lucene TEXT index. When enabled on a dictionary-encoded STRING column, the Lucene index is built over the column dictionary — one document per distinct value, with the Lucene docId equal to the dictId — instead of one document per row.

This makes the text index size scale with cardinality rather than row count, which is a large win for low/medium-cardinality text columns, while keeping the full Lucene query surface. TEXT_MATCH returns the matching dictIds, which are resolved to docIds through the existing dictionary-based filter operators (inverted index / scan) — mirroring the FST / REGEXP_LIKE path.

Existing behavior is unchanged when the flag is off.

Design (mirrors the FST dictionary pattern)

  • Config: FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY + TextIndexConfig.isBuildOnDictionary() (default false), parsed in TextIndexConfigBuilder.
  • Creation: TextIndexType.createIndexCreator feeds the sorted dictionary values to a dictionary-mode LuceneTextIndexCreator (the per-row add() becomes a no-op). Dictionary mode is gated to commit == true, so consuming (realtime) segments keep building per-row and the dictionary-based index is produced on seal/convert and on offline ingestion. Reload is handled by TextIndexHandler looping the dictionary, mirroring FSTIndexHandler.
  • Self-describing segments (mixed-table safe): the build mode is persisted into the segment's lucene.properties, and the reader derives its mode from the segment, not the live table config. A table holding a mix of dictionary-based and per-row text segments (e.g. during rollout) reads each correctly. The table-config flag governs creation only.
  • Reader: LuceneTextIndexReader serves getDictIds() in dictionary mode and uses a no-op docId translator (Lucene docId already equals dictId).
  • Query: new TextMatchDictIdPredicateEvaluatorFactory; FilterPlanNode routes a dictionary-based TEXT_MATCH through FilterOperatorUtils to the standard scan/inverted operators. Single-value and multi-value STRING columns supported.
  • Validation (hard failure): buildOnDictionary on a non-dictionary-encoded column is rejected at table-config validation and again by a creation-time precondition.

Limitations / follow-ups

  • TEXT_MATCH options are not supported in dictionary mode (the dictionary lookup takes only the query value) — such queries are rejected rather than silently ignored.
  • Realtime consuming segments always build the per-row index; the dictionary-based index is produced when the segment is sealed/converted.

Upgrade Notes

Opt-in and default-off, so no behavior change on upgrade. The build mode is recorded per segment, so a table can hold a mix of dictionary-based and per-row text segments during a rolling enablement and each is read correctly. Flipping the flag triggers a text-index rebuild on segment reload.

Testing

DictionaryBasedTextIndexQueriesTest (new):

  • dictionary-based TEXT_MATCH results equal the per-row index for SV and MV columns, including OR and zero-match queries;
  • reload builds the dictionary-based index from the dictionary;
  • the dictionary-based segment is smaller than per-row for a low-cardinality column;
  • a non-dictionary column with buildOnDictionary=true fails the build;
  • TEXT_MATCH with options is rejected for a dictionary-based index.

TextIndexConfigTest: round-trips the new flag.

Regression (all green): per-row TextSearchQueriesTest, FSTBasedRegexpLikeQueriesTest, IFSTBasedRegexpLikeQueriesTest, and segment-local LuceneTextIndexCreatorTest / TextIndexTest / LuceneTextIndexConfigReloadTest / TextIndexUtilsTest. spotless / license / checkstyle clean.

Labels

feature, text-index

…lse)

Adds an opt-in buildOnDictionary flag for the Lucene TEXT index. When enabled
on a dictionary-encoded STRING column, the Lucene index is built over the
column dictionary (one document per distinct value, Lucene docId == dictId)
instead of one document per row. This makes the index size scale with
cardinality rather than row count, and TEXT_MATCH resolves the matching dictIds
to docIds through the existing dictionary-based filter operators (inverted
index / scan), mirroring the FST/REGEXP_LIKE path.

Details:
- Config: FieldConfig.TEXT_INDEX_BUILD_ON_DICTIONARY + TextIndexConfig
  .isBuildOnDictionary() (default false), parsed in TextIndexConfigBuilder.
- Creation: TextIndexType.createIndexCreator feeds the sorted dictionary values
  to a dict-mode LuceneTextIndexCreator (per-row add() is a no-op; dict mode is
  gated to commit==true so consuming segments keep building per-row). Reload is
  handled by TextIndexHandler looping the dictionary, mirroring FSTIndexHandler.
- Self-describing segment: the build mode is persisted in lucene.properties; the
  reader derives its mode from the segment, not the live table config, so a
  table holding a mix of dictionary-based and per-row text segments reads each
  correctly. The table-config flag governs creation only.
- Reader: LuceneTextIndexReader serves getDictIds() in dictionary mode and uses
  a no-op docId translator (Lucene docId already equals dictId).
- Query: new TextMatchDictIdPredicateEvaluatorFactory; FilterPlanNode routes a
  dictionary-based TEXT_MATCH through FilterOperatorUtils. SV and MV supported.
  TEXT_MATCH options are not supported in dictionary mode (rejected).
- Validation: buildOnDictionary requires a dictionary-encoded column (hard
  failure at table-config validation and a creation-time precondition).

Tests: dictionary-based TEXT_MATCH equals the per-row index for SV/MV, OR and
zero-match queries; reload builds the index from the dictionary; size is smaller
for low cardinality; non-dictionary column and TEXT_MATCH options are rejected;
TextIndexConfig round-trips the flag.
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.10638% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.80%. Comparing base (b0cb24b) to head (8149c49).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...pache/pinot/segment/spi/index/TextIndexConfig.java 64.70% 6 Missing ⚠️
...egment/local/segment/index/text/TextIndexType.java 80.00% 2 Missing and 1 partial ⚠️
...ment/creator/impl/text/LuceneTextIndexCreator.java 80.00% 0 Missing and 2 partials ⚠️
...ment/index/readers/text/LuceneTextIndexReader.java 88.23% 1 Missing and 1 partial ⚠️
...cate/TextMatchDictIdPredicateEvaluatorFactory.java 94.11% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18758      +/-   ##
============================================
+ Coverage     64.59%   64.80%   +0.21%     
- Complexity     1305     1309       +4     
============================================
  Files          3373     3381       +8     
  Lines        208894   209633     +739     
  Branches      32667    32816     +149     
============================================
+ Hits         134937   135857     +920     
+ Misses        63116    62838     -278     
- Partials      10841    10938      +97     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.80% <85.10%> (+0.21%) ⬆️
temurin 64.80% <85.10%> (+0.21%) ⬆️
unittests 64.80% <85.10%> (+0.21%) ⬆️
unittests1 57.01% <84.04%> (-0.02%) ⬇️
unittests2 37.25% <15.95%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 added feature New functionality text-search Related to text/Lucene indexing and search index Related to indexing (general) labels Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in buildOnDictionary mode for Lucene TEXT indexes so that (for dictionary-encoded STRING columns) the Lucene index can be built per distinct dictionary value (docId == dictId) rather than per row, reducing index size for lower-cardinality text columns while preserving TEXT_MATCH semantics via dictId→docId resolution through existing dictionary-based filter operators.

Changes:

  • Introduces buildOnDictionary config plumbed through FieldConfig/TextIndexConfig, parsing, equality/hash, and persistence to lucene.properties.
  • Implements dictionary-mode creation/handling (segment creation + reload) and reader support (getDictIds(), no-op docId translator).
  • Routes dictionary-based TEXT_MATCH through a new dict-id predicate evaluator + adds an end-to-end query test suite.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java Adds TEXT_INDEX_BUILD_ON_DICTIONARY field property constant.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/TextIndexConfig.java Adds buildOnDictionary to TEXT index config + builder + equality/hash.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/TextIndexReader.java Adds isBuildOnDictionary() default contract for readers.
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/TextIndexConfigTest.java Adds JSON parsing/round-trip test for buildOnDictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/store/TextIndexUtils.java Persists/recovers buildOnDictionary in Lucene properties.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexType.java Validates/enforces dictionary mode and seeds creator from unique dictionary values.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/text/TextIndexConfigBuilder.java Parses buildOnDictionary from field properties.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/text/LuceneTextIndexReader.java Implements dictionary-mode reading (isBuildOnDictionary, getDictIds, no-op translator).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/TextIndexHandler.java Builds dictionary-mode index on reload by iterating dictionary values.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/text/LuceneTextIndexCreator.java Makes per-row add() callbacks no-ops in dictionary mode; disables reuse-mutable-index for dict mode.
pinot-core/src/main/java/org/apache/pinot/core/plan/FilterPlanNode.java Routes dictionary-mode TEXT_MATCH via dict-id predicate evaluator + dictionary-based filter operators.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/TextMatchDictIdPredicateEvaluatorFactory.java New predicate evaluator factory returning matching dictIds bitmap.
pinot-core/src/test/java/org/apache/pinot/queries/DictionaryBasedTextIndexQueriesTest.java New end-to-end tests comparing per-row vs dictionary-mode TEXT_MATCH and reload behavior.

xiangfu0 added 3 commits June 15, 2026 01:27
… absent)

When a text-index segment's persisted lucene.properties does not contain the
buildOnDictionary key (segments written before this feature), the reader now
defaults to false instead of falling back to the live table-config value. This
preserves the segment-self-describes guarantee: a per-row segment can never be
misread as dictionary-based after the flag is flipped on at the table level.

- TextIndexUtils.getUpdatedConfigFromPropertiesFile: default false when absent.
- LuceneTextIndexReader.updateConfigFromProperties (buffer path): default false
  when absent.
- Test: segmentWithoutPersistedFlagReadsAsPerRow strips the key from a per-row
  segment's properties, reloads under a buildOnDictionary=true table config, and
  asserts the segment is still read as per-row with correct TEXT_MATCH results.
The Pinot Binary Compatibility Check (japicmp) flagged the buildOnDictionary
parameter added to the public TextIndexConfig @JsonCreator constructor as a
removed constructor in the pinot-segment-spi SPI artifact. Restore the prior
19-arg public constructor (delegating to the canonical constructor with
buildOnDictionary defaulted to false) so old compiled callers keep linking.
Verified locally with japicmp 0.23.1 --error-on-binary-incompatibility: both
pinot-spi and pinot-segment-spi report no binary-incompatible changes.
Addresses the codecov patch-coverage report on the PR by covering the
previously-untested feature paths:
- store-in-segment-file (buffer-based reader) path in dictionary mode, which
  exercises the buffer LuceneTextIndexReader constructor and the buffer property
  overlay;
- a TEXT_MATCH that matches every distinct dictionary value, exercising the
  always-true short-circuit in the dict-id predicate evaluator;
- TextIndexConfig: buildOnDictionary participation in equals/hashCode and the
  retained pre-buildOnDictionary (binary-compat) constructor defaulting to false.

Test-only change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality index Related to indexing (general) text-search Related to text/Lucene indexing and search

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants