Skip to content

Make CUDA/AOTI partitioner composable after another delegate#20077

Open
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D107690797
Open

Make CUDA/AOTI partitioner composable after another delegate#20077
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:export-D107690797

Conversation

@shoumikhin
Copy link
Copy Markdown
Contributor

@shoumikhin shoumikhin commented Jun 5, 2026

Summary:
AotiPartitioner.partition tagged every call_function node, including executorch_call_delegate calls already lowered by an earlier partitioner. So when CudaPartitioner runs as a second partitioner — e.g. after a TensorRT partition in a stacked .pte where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one .pte.

Tag only the non-lowered nodes, reusing the existing get_non_lowered_nodes helper (which already excludes executorch_call_delegate calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so get_non_lowered_nodes returns every call_function and behavior is unchanged.

The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here — tag_constant_data already claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat.

Differential Revision: D107690797

Copilot AI review requested due to automatic review settings June 5, 2026 20:26
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20077

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d9d130 with merge base 0d904b6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Jun 5, 2026

@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107690797.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes composability of the CUDA (AOTInductor/AOTI) partitioner when it runs after another backend has already lowered part of the graph into executorch_call_delegate calls, preventing CUDA from re-tagging those already-delegated nodes and producing malformed nested delegates.

Changes:

  • Update AotiPartitioner.partition() to tag only “non-lowered” call_function nodes by using get_non_lowered_nodes(), skipping existing executorch_call_delegate calls and their output getitems.
  • Add a CUDA partitioner unit test that splices a fake already-lowered delegate into an exported graph and verifies CUDA does not re-tag it while still tagging remaining ATen ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
backends/aoti/aoti_partitioner.py Uses get_non_lowered_nodes() to avoid tagging already-lowered delegate nodes, enabling stacked backend composition.
backends/cuda/tests/test_cuda_partitioner.py Adds regression coverage ensuring CUDA partitioning does not re-tag pre-existing executorch_call_delegate nodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shoumikhin
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot Bot added the release notes: none Do not include this in the release notes label Jun 5, 2026
@meta-codesync meta-codesync Bot changed the title Make CUDA/AOTI partitioner composable after another delegate Make CUDA/AOTI partitioner composable after another delegate (#20077) Jun 6, 2026
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 6, 2026
…#20077)

Summary:

`AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`.

Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat.

Differential Revision: D107690797
@shoumikhin shoumikhin force-pushed the export-D107690797 branch from fe3db07 to e3a69f3 Compare June 6, 2026 06:58
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 6, 2026
…#20077)

Summary:

`AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`.

Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat.

Differential Revision: D107690797
@shoumikhin shoumikhin force-pushed the export-D107690797 branch from e3a69f3 to cf5503f Compare June 6, 2026 06:59
shoumikhin added a commit to shoumikhin/executorch that referenced this pull request Jun 6, 2026
…#20077)

Summary:

`AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`.

Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat.

Differential Revision: D107690797
@shoumikhin shoumikhin force-pushed the export-D107690797 branch from cf5503f to bf54e2f Compare June 6, 2026 07:35
Summary:
`AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`.

Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged.

The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here — `tag_constant_data` already claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat.

Differential Revision: D107690797
@meta-codesync meta-codesync Bot changed the title Make CUDA/AOTI partitioner composable after another delegate (#20077) Make CUDA/AOTI partitioner composable after another delegate Jun 6, 2026
@shoumikhin shoumikhin force-pushed the export-D107690797 branch from bf54e2f to 3d9d130 Compare June 6, 2026 17:04
Copilot AI review requested due to automatic review settings June 6, 2026 17:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants