[GH-3089] Fix empty-envelope false candidate and RS_DWithin EXPLAIN label in raster distance join#3090
Draft
jiayuasu wants to merge 1 commit into
Draft
Conversation
…LAIN label in raster distance join Two issues in the raster distance-join machinery built around RS_DWithin (apacheGH-3089), both confined to spark/common: 1. Empty-envelope false candidate. When a raster or geometry input is NULL, the join substitutes an empty GeometryCollection. expandRasterFilterEnvelope then expanded that geometry's degenerate envelope by `distance`, producing a non-empty R-tree filter envelope that pulls the row in as a coarse-filter candidate (and can yield a NaN-bounded envelope). Return the empty shape unchanged so the filter excludes it. 2. Misleading EXPLAIN output. The raster distance branch of BroadcastIndexJoinExec.simpleString printed "RS_Distance(left, right) < r" — a function that does not exist in Sedona. Print "RS_DWithin(left, right, r)" to match the actual join predicate.
5e0ee4c to
be8ad38
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses two correctness/diagnostic issues in Sedona’s raster distance-join implementation (RS_DWithin) within spark/common: preventing NULL inputs (represented as empty shapes) from creating spurious coarse-filter candidates, and correcting the physical plan EXPLAIN label to reflect the actual raster predicate.
Changes:
- Prevent expansion of empty raster/geometry envelopes in
expandRasterFilterEnvelope, avoiding false R-tree candidates for NULL inputs. - Update
BroadcastIndexJoinExec.simpleStringto displayRS_DWithin(left, right, r)instead of a non-existentRS_Distance(...) < rpredicate.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/TraitJoinQueryBase.scala | Adds an early return for empty shapes to keep NULL-substituted geometries from expanding into non-empty coarse filter envelopes. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/strategy/join/BroadcastIndexJoinExec.scala | Fixes raster distance-join EXPLAIN/simpleString rendering to show RS_DWithin(...) (the actual predicate). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+191
to
+196
| // An empty shape (e.g. the empty GeometryCollection substituted for a NULL raster/geometry) | ||
| // must stay empty: expanding its degenerate envelope by `distance` would yield a non-empty | ||
| // filter geometry that spuriously matches rows the predicate should never join. | ||
| if (baseShape.isEmpty) { | ||
| return baseShape | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Raster distance join: NULL inputs produce spurious R-tree candidates, plus misleading RS_Distance EXPLAIN label #3089What changes were proposed in this PR?
Two issues in the raster distance-join machinery built around
RS_DWithin, reported in #3089. Both changes are confined tospark/common.Empty-envelope false candidate. When a raster or geometry input is
NULL, the join substitutes an emptyGeometryCollection.expandRasterFilterEnvelopethen expanded that geometry's degenerate envelope bydistance, producing a non-empty R-tree filter envelope that pulls the row in as a coarse-filter candidate (and can produce aNaN-bounded envelope from the nullEnvelope). It now returns the empty shape unchanged so the coarse filter excludes it.Misleading EXPLAIN output. The raster distance branch of
BroadcastIndexJoinExec.simpleStringprintedRS_Distance(left, right) < r— a function that does not exist in Sedona. It now printsRS_DWithin(left, right, r), naming the actual predicate that drives the join.How was this patch tested?
sedona-spark-commoncompiles cleanly with the changes. Both are minimal changes within existing code paths exercised byRasterJoinSuiteandBroadcastIndexJoinSuite.Did this PR include necessary documentation updates?