Make CUDA/AOTI partitioner composable after another delegate by shoumikhin · Pull Request #20077 · pytorch/executorch

shoumikhin · 2026-06-05T20:26:20Z

Summary:
AotiPartitioner.partition tagged every call_function node, including executorch_call_delegate calls already lowered by an earlier partitioner. So when CudaPartitioner runs as a second partitioner — e.g. after a TensorRT partition in a stacked .pte where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one .pte.

Tag only the non-lowered nodes, reusing the existing get_non_lowered_nodes helper (which already excludes executorch_call_delegate calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so get_non_lowered_nodes returns every call_function and behavior is unchanged.

The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here — tag_constant_data already claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat.

Differential Revision: D107690797

pytorch-bot · 2026-06-05T20:26:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20077

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d9d130 with merge base 0d904b6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-05T20:26:30Z

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107690797.

Copilot

Pull request overview

This PR fixes composability of the CUDA (AOTInductor/AOTI) partitioner when it runs after another backend has already lowered part of the graph into executorch_call_delegate calls, preventing CUDA from re-tagging those already-delegated nodes and producing malformed nested delegates.

Changes:

Update AotiPartitioner.partition() to tag only “non-lowered” call_function nodes by using get_non_lowered_nodes(), skipping existing executorch_call_delegate calls and their output getitems.
Add a CUDA partitioner unit test that splices a fake already-lowered delegate into an exported graph and verifies CUDA does not re-tag it while still tagging remaining ATen ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
backends/aoti/aoti_partitioner.py	Uses `get_non_lowered_nodes()` to avoid tagging already-lowered delegate nodes, enabling stacked backend composition.
backends/cuda/tests/test_cuda_partitioner.py	Adds regression coverage ensuring CUDA partitioning does not re-tag pre-existing `executorch_call_delegate` nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

shoumikhin · 2026-06-05T21:13:49Z

@pytorchbot label "release notes: none"

…#20077) Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat. Differential Revision: D107690797

Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here — `tag_constant_data` already claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat. Differential Revision: D107690797

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 5, 2026 20:26

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2026

Copilot started reviewing on behalf of shoumikhin June 5, 2026 20:26 View session

meta-codesync Bot added the meta-exported label Jun 5, 2026

Copilot AI reviewed Jun 5, 2026

View reviewed changes

pytorch-bot Bot added the release notes: none Do not include this in the release notes label Jun 5, 2026

meta-codesync Bot changed the title ~~Make CUDA/AOTI partitioner composable after another delegate~~ Make CUDA/AOTI partitioner composable after another delegate (#20077) Jun 6, 2026

shoumikhin force-pushed the export-D107690797 branch from fe3db07 to e3a69f3 Compare June 6, 2026 06:58

shoumikhin force-pushed the export-D107690797 branch from e3a69f3 to cf5503f Compare June 6, 2026 06:59

shoumikhin force-pushed the export-D107690797 branch from cf5503f to bf54e2f Compare June 6, 2026 07:35

meta-codesync Bot changed the title ~~Make CUDA/AOTI partitioner composable after another delegate (#20077)~~ Make CUDA/AOTI partitioner composable after another delegate Jun 6, 2026

shoumikhin force-pushed the export-D107690797 branch from bf54e2f to 3d9d130 Compare June 6, 2026 17:04

Copilot AI review requested due to automatic review settings June 6, 2026 17:04

Copilot started reviewing on behalf of shoumikhin June 6, 2026 17:04 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CUDA/AOTI partitioner composable after another delegate#20077

Make CUDA/AOTI partitioner composable after another delegate#20077
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D107690797

shoumikhin commented Jun 5, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

shoumikhin commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shoumikhin commented Jun 5, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20077

✅ No Failures

Uh oh!

meta-codesync Bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

shoumikhin commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shoumikhin commented Jun 5, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 5, 2026 •

edited

Loading