Optimize FieldExistsQuery.count() when all docs have the field#16111
Conversation
5ff0197 to
10b7760
Compare
10b7760 to
0ac5a3b
Compare
0ac5a3b to
2b4a435
Compare
2b4a435 to
dc362c8
Compare
| if (pointValues != null) { | ||
| return pointValues.getDocCount(); | ||
| } |
There was a problem hiding this comment.
new logic falls through to super.count(context) == -1, which forces IndexSearcher to do a full scan to arrive at the same 0. Not a correctness regression, but it does take a constant-time path to a linear one for these edge cases. Probably fine since the case is rare in practice?
There was a problem hiding this comment.
Yeah... losing the zero case for this and the Terms check is not ideal.
There was a problem hiding this comment.
yes, actually. updated, thanks...when null will not do full scan now
There was a problem hiding this comment.
I noticed that you did the optimization just for the DocValuesSkipper case, but I was thinking you might be able to do something like:
// Don't do the `hasDeletions` check yet.
int count = -1;
if (fieldInfo.getPointDimensionCount() > 0) {
PointValues pointValues = reader.getPointValues(field);
count = (pointValues == null ? 0 : pointValues.getDocCount());
} else if (fieldInfo.getIndexOptions() != IndexOptions.NONE) {
Terms terms = reader.terms(field);
count = (terms == null ? 0 : terms.getDocCount());
} else if (fieldInfo.docValuesSkipIndexType() != DocValuesSkipIndexType.NONE) {
DocValuesSkipper docValuesSkipper = reader.getDocValuesSkipper(field);
count = (docValuesSkipper == null ? 0 : docValuesSkipper.docCount());
}
if (count == 0) {
// One of the above cases shows the field is not present on this leaf
return 0;
} else if (count == reader.maxDoc()) {
// All docs in the leaf (live or deleted) have the field. Return the count of live docs.
return reader.numDocs();
} else if (count >=0 && reader.hasDeletions() == false) {
// No deleted docs. The computed count can be trusted.
return count;
}
// Some docs don't have the field and some docs are deleted.
// Need to scan to get the correct intersection between field exists docs and live docs.
return super.count(context);
Honestly, I don't know if there's any reason why we couldn't extend this to cover the norms and vectors cases too. I think those three special cases (count == 0, count == maxDoc, no deletions) apply to any of the field types.
@iprithv -- what do you think?
| // DocValuesSkipper provides an exact doc count for doc values, so we can use it | ||
| // reliably even in the presence of deletions. |
There was a problem hiding this comment.
This comment is a bit confusing.
I interpreted it as "The DocValuesSkipper knows how many live docs have the field", which confused me, since it doesn't have access to the live docs (since the DocValuesProducer's SegmentReadState has SegmentInfos, but not SegmentCommitInfos).
Actually reading the code, it's clear that we're saying "If every document in the segment, live or deleted, has the field, then the count is the number of live docs."
There was a problem hiding this comment.
ah yeah, updated the comment. thanks!
| if (pointValues != null) { | ||
| return pointValues.getDocCount(); | ||
| } |
There was a problem hiding this comment.
Yeah... losing the zero case for this and the Terms check is not ideal.
makes sense, I kept norms as a special case though getDocCount() for norms actually returns postings count, not norms count (docs can have norms without postings when analyzer produces empty token stream). so the unified logic works for vectors/docValues/points/terms, but norms still only use the maxDoc shortcut. thanks! |
I mistakenly left a whitespace when resolving conflicts.
Optimize FieldExistsQuery.count() when all docs have the field.
Vectors return numDocs() directly.
Doc values only optimize with deletions when DocValuesSkipper is available.
points/terms are not used as proxies since they may not match doc values coverage exactly.