Skip to content

Optimize FieldExistsQuery.count() when all docs have the field#16111

Merged
msfroh merged 5 commits into
apache:mainfrom
iprithv:optimize-fieldexists-count
Jun 16, 2026
Merged

Optimize FieldExistsQuery.count() when all docs have the field#16111
msfroh merged 5 commits into
apache:mainfrom
iprithv:optimize-fieldexists-count

Conversation

@iprithv

@iprithv iprithv commented May 23, 2026

Copy link
Copy Markdown
Contributor

Optimize FieldExistsQuery.count() when all docs have the field.
Vectors return numDocs() directly.
Doc values only optimize with deletions when DocValuesSkipper is available.
points/terms are not used as proxies since they may not match doc values coverage exactly.

@iprithv iprithv force-pushed the optimize-fieldexists-count branch 2 times, most recently from 5ff0197 to 10b7760 Compare May 23, 2026 01:59
@github-actions github-actions Bot added this to the 11.0.0 milestone May 23, 2026
@iprithv iprithv force-pushed the optimize-fieldexists-count branch from 10b7760 to 0ac5a3b Compare May 28, 2026 23:43
@github-actions github-actions Bot modified the milestones: 11.0.0, 10.5.0 May 28, 2026
@iprithv iprithv force-pushed the optimize-fieldexists-count branch from 0ac5a3b to 2b4a435 Compare June 1, 2026 21:10
@iprithv iprithv force-pushed the optimize-fieldexists-count branch from 2b4a435 to dc362c8 Compare June 2, 2026 07:42
Comment on lines +266 to +268
if (pointValues != null) {
return pointValues.getDocCount();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new logic falls through to super.count(context) == -1, which forces IndexSearcher to do a full scan to arrive at the same 0. Not a correctness regression, but it does take a constant-time path to a linear one for these edge cases. Probably fine since the case is rare in practice?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... losing the zero case for this and the Terms check is not ideal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, actually. updated, thanks...when null will not do full scan now

@msfroh msfroh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you did the optimization just for the DocValuesSkipper case, but I was thinking you might be able to do something like:

// Don't do the `hasDeletions` check yet.
int count = -1;
if (fieldInfo.getPointDimensionCount() > 0) {
  PointValues pointValues = reader.getPointValues(field);
  count = (pointValues == null ? 0 : pointValues.getDocCount());
} else if (fieldInfo.getIndexOptions() != IndexOptions.NONE) {
  Terms terms = reader.terms(field);
  count =  (terms == null ? 0 : terms.getDocCount());
} else if (fieldInfo.docValuesSkipIndexType() != DocValuesSkipIndexType.NONE) {
  DocValuesSkipper docValuesSkipper = reader.getDocValuesSkipper(field);
  count = (docValuesSkipper == null ? 0 : docValuesSkipper.docCount());
}
if (count == 0) {
  // One of the above cases shows the field is not present on this leaf
  return 0;
} else if (count == reader.maxDoc()) {
  // All docs in the leaf (live or deleted) have the field. Return the count of live docs.
  return reader.numDocs();
} else if (count >=0 && reader.hasDeletions() == false) {
  // No deleted docs. The computed count can be trusted.
  return count;
}
// Some docs don't have the field and some docs are deleted. 
// Need to scan to get the correct intersection between field exists docs and live docs.
return super.count(context);

Honestly, I don't know if there's any reason why we couldn't extend this to cover the norms and vectors cases too. I think those three special cases (count == 0, count == maxDoc, no deletions) apply to any of the field types.

@iprithv -- what do you think?

Comment on lines +249 to +250
// DocValuesSkipper provides an exact doc count for doc values, so we can use it
// reliably even in the presence of deletions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing.

I interpreted it as "The DocValuesSkipper knows how many live docs have the field", which confused me, since it doesn't have access to the live docs (since the DocValuesProducer's SegmentReadState has SegmentInfos, but not SegmentCommitInfos).

Actually reading the code, it's clear that we're saying "If every document in the segment, live or deleted, has the field, then the count is the number of live docs."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, updated the comment. thanks!

Comment on lines +266 to +268
if (pointValues != null) {
return pointValues.getDocCount();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... losing the zero case for this and the Terms check is not ideal.

@iprithv

iprithv commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

@iprithv -- what do you think?

makes sense, I kept norms as a special case though getDocCount() for norms actually returns postings count, not norms count (docs can have norms without postings when analyzer produces empty token stream). so the unified logic works for vectors/docValues/points/terms, but norms still only use the maxDoc shortcut. thanks!

@iprithv iprithv requested review from jainankitk and msfroh June 5, 2026 04:49
@msfroh msfroh merged commit 0be06f9 into apache:main Jun 16, 2026
13 checks passed
gaobinlong pushed a commit to gaobinlong/lucene that referenced this pull request Jun 17, 2026
vijaykriishna pushed a commit to vijaykriishna/lucene that referenced this pull request Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants