feat: support SortAggregateExec [WIP] by andygrove · Pull Request #4565 · apache/datafusion-comet

andygrove · 2026-06-02T14:35:33Z

Which issue does this PR close?

N/A — proactive coverage for an aggregate operator Spark sometimes plans that Comet was previously falling back on.

Rationale for this change

Spark's planner picks SortAggregateExec when neither HashAggregateExec nor ObjectHashAggregateExec fits — typically when spark.sql.execution.useObjectHashAggregateExec=false is set, or for TypedImperativeAggregate functions whose buffer state can't ride hash aggregation. Today Comet doesn't recognize the operator at all, so any such Partial→Final pair runs on Spark and tends to drag the surrounding shuffle off Comet too. CollectSet under useObjectHashAggregate=false is the motivating example.

What changes are included in this PR?

Map SortAggregateExec to a new CometSortAggregateExec serde object that reuses the existing CometBaseAggregate.doConvert path. Same Comet-shuffle gate as CometObjectHashAggregateExec (their TypedImperativeAggregate buffer formats differ between Spark and Comet, so a Comet-Partial / Spark-Final split would crash).
createExec produces a CometHashAggregateExec wrapper — the same approach CometObjectHashAggregateExec already uses. CometExec.outputOrdering defaults to originalPlan.outputOrdering, so SortAggregateExec's grouping-key ordering flows through unchanged for any downstream operator that elided a sort against it.
No proto / Rust changes: DataFusion's AggregateExec::try_new auto-detects InputOrderMode::Sorted from the child's output ordering, so sorted input naturally produces sorted output and the streaming-aggregate optimization kicks in for free.
Cleanups landed in passing: lifted getSupportLevel (was duplicated 3x) and adjustOutputForNativeState (was duplicated verbatim 2x) onto the CometBaseAggregate trait, and collapsed three Spark-side aggregate enumeration sites (isAggregate, findCometPartialAgg, the shuffle guard in canAggregateBeConverted) to match BaseAggregateExec once instead of listing each subclass.

What's NOT included

Arbitrary user-defined TypedImperativeAggregate UDAFs still fall back — their update/merge/serialize methods are JVM code that native execution can't invoke. This PR only enables SortAggregateExec for aggregate functions Comet already implements natively (currently CollectSet and BloomFilterAggregate among the TypedImperative set; future natives like CollectList will benefit automatically).

How are these changes tested?

Added CometAggregateSuite test "SortAggregate with collect_set is converted to native": forces useObjectHashAggregateExec=false, asserts Spark plans SortAggregateExec for the query, then runs Comet and asserts the entire plan executes natively and matches Spark. Verified locally with the full CometAggregateSuite (80 tests pass) and CometExecRuleSuite.

Wire SortAggregateExec through the existing CometBaseAggregate.doConvert path so that queries planned with useObjectHashAggregate=false (or with TypedImperativeAggregate functions whose buffer formats prevent hash aggregation) can run their Partial->Final pair natively instead of falling back to Spark. CollectSet was the motivating function; the same wiring covers any other natively-implemented TypedImperativeAggregate. No proto changes: DataFusion's AggregateExec::try_new auto-detects InputOrderMode::Sorted from the child's output ordering, so a sorted input naturally produces sorted output. Wrapper reuse: createExec produces a CometHashAggregateExec (matching CometObjectHashAggregateExec). CometExec.outputOrdering already defaults to originalPlan.outputOrdering, so SortAggregateExec.outputOrdering flows through unchanged for downstream operators that elided a sort against it. Cleanups landed in passing: - Lifted baseAggregateSupportLevel and adjustOutputForNativeState onto the CometBaseAggregate trait (was duplicated 3x and 2x). - Collapsed isAggregate / findCometPartialAgg's Spark-side branches / canAggregateBeConverted's shuffle guard to BaseAggregateExec.

andygrove changed the title ~~feat: support SortAggregateExec~~ feat: support SortAggregateExec [WIP] Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support SortAggregateExec [WIP]#4565

feat: support SortAggregateExec [WIP]#4565
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/sort-aggregate

andygrove commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Jun 2, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

What's NOT included

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant