Skip to content

fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse#4567

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-collation-concat-reverse
Open

fix: mark non-UTF8_BINARY collations as Incompatible for concat and reverse#4567
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-collation-concat-reverse

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Addresses the collation-related items (1, 2, and 3) of #4504. The remaining medium-priority items in that tracking issue (the array/size.sql MapType test and the CometSize defensive branch) are not collation-related and are left for a separate change, so this PR does not close #4504.

Rationale for this change

Spark 4.0 widened Concat.allowedTypes and the string branch of Reverse.inputTypes from StringType to StringTypeWithCollation, and both preserve the collation in their result type. Comet's native concat/reverse UDFs operate on raw bytes and produce Utf8 (UTF8_BINARY semantics), so for a non-default (non-UTF8_BINARY) collation the native result silently diverges from Spark. Per the audit-comet-expression conventions, such Spark 4.0+ collation gaps must surface as Incompatible(Some(reason)) so the divergence appears in EXPLAIN and the generated compatibility doc, and so the expression falls back to Spark by default.

What changes are included in this PR?

  • CometConcat (item 1): now reports Incompatible when any child or the result carries a non-default collation. The string-input check uses isInstanceOf[StringType] instead of == DataTypes.StringType, because a collated StringType is not equal to the default StringType and was otherwise misreported via the unsupported-input-type branch. getIncompatibleReasons lists the new reason.
  • CometReverse (item 2): the non-array (string) branch now reports Incompatible when the child carries a non-default collation.
  • CometReverse.getIncompatibleReasons (item 3): now appends the string-collation reason to the array-branch reason so every Incompatible(Some(r)) branch is represented (reaches EXPLAIN and the compatibility doc).

Both reasons are concise and link to the collation tracking issue #2190.

How are these changes tested?

Added expect_fallback cases to spark/src/test/resources/sql-tests/expressions/string/collation.sql (already gated MinSparkVersion: 4.0) asserting that concat/reverse on UTF8_LCASE strings fall back to Spark with the expected reason. Verified on Spark 4.1; confirmed concat/reverse remain Compatible and the existing fixtures still pass on both Spark 3.5 and 4.1, and that the collation file is correctly skipped on 3.5.

…everse

Spark 4.0 widens `Concat` and the string branch of `Reverse` to accept
collated strings and preserves the collation in the result type. Comet's
native `concat`/`reverse` UDFs produce UTF8 (UTF8_BINARY semantics), so a
non-default collation silently diverges from Spark.

- CometConcat: detect string inputs with isInstanceOf[StringType] (so a
  collated StringType, which is not == the default StringType, is still
  treated as string input) and report Incompatible for non-UTF8_BINARY
  collations instead of Compatible / a misleading unsupported-type reason.
- CometReverse: report Incompatible for the non-array string branch when
  the child carries a non-default collation.
- CometReverse.getIncompatibleReasons now also lists the string-collation
  reason alongside the array-branch reason, so it reaches EXPLAIN and the
  generated compatibility doc.

Addresses the collation items (1-3) of apache#4504.
Copy link
Copy Markdown
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Maybe add another test case for a standard collation sequence like UNICODE_CI or UNICODE_CI_AI just to verify the fallback.

@andygrove
Copy link
Copy Markdown
Member Author

lgtm Maybe add another test case for a standard collation sequence like UNICODE_CI or UNICODE_CI_AI just to verify the fallback.

Added. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

collection expression audit follow-ups (from #4473)

2 participants