Make CUDA/AOTI partitioner composable after another delegate#20077
Make CUDA/AOTI partitioner composable after another delegate#20077shoumikhin wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20077
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3d9d130 with merge base 0d904b6 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@shoumikhin has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107690797. |
There was a problem hiding this comment.
Pull request overview
This PR fixes composability of the CUDA (AOTInductor/AOTI) partitioner when it runs after another backend has already lowered part of the graph into executorch_call_delegate calls, preventing CUDA from re-tagging those already-delegated nodes and producing malformed nested delegates.
Changes:
- Update
AotiPartitioner.partition()to tag only “non-lowered”call_functionnodes by usingget_non_lowered_nodes(), skipping existingexecutorch_call_delegatecalls and their outputgetitems. - Add a CUDA partitioner unit test that splices a fake already-lowered delegate into an exported graph and verifies CUDA does not re-tag it while still tagging remaining ATen ops.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| backends/aoti/aoti_partitioner.py | Uses get_non_lowered_nodes() to avoid tagging already-lowered delegate nodes, enabling stacked backend composition. |
| backends/cuda/tests/test_cuda_partitioner.py | Adds regression coverage ensuring CUDA partitioning does not re-tag pre-existing executorch_call_delegate nodes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@pytorchbot label "release notes: none" |
…#20077) Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat. Differential Revision: D107690797
fe3db07 to
e3a69f3
Compare
…#20077) Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat. Differential Revision: D107690797
e3a69f3 to
cf5503f
Compare
…#20077) Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. Mirrored in fbcode and xplat. Differential Revision: D107690797
cf5503f to
bf54e2f
Compare
Summary: `AotiPartitioner.partition` tagged every `call_function` node, including `executorch_call_delegate` calls already lowered by an earlier partitioner. So when `CudaPartitioner` runs as a second partitioner — e.g. after a TensorRT partition in a stacked `.pte` where TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one `.pte`. Tag only the non-lowered nodes, reusing the existing `get_non_lowered_nodes` helper (which already excludes `executorch_call_delegate` calls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, so `get_non_lowered_nodes` returns every `call_function` and behavior is unchanged. The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here — `tag_constant_data` already claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat. Differential Revision: D107690797
bf54e2f to
3d9d130
Compare
Summary:
AotiPartitioner.partitiontagged everycall_functionnode, includingexecutorch_call_delegatecalls already lowered by an earlier partitioner. So whenCudaPartitionerruns as a second partitioner — e.g. after a TensorRT partition in a stacked.ptewhere TensorRT lowers the ops it can and the CUDA backend handles the rest — it tried to re-delegate the foreign delegate node, producing a malformed nested delegate. This is the blocker to composing the two backends in one.pte.Tag only the non-lowered nodes, reusing the existing
get_non_lowered_nodeshelper (which already excludesexecutorch_call_delegatecalls and their output getitems), so the partitioner claims just the remaining ops and composes cleanly after another backend. In the single-partitioner case there are no delegate nodes, soget_non_lowered_nodesreturns everycall_functionand behavior is unchanged.The same composition gap existed for constants: the final loop tagged every untagged param/buffer/lifted constant with this partition's tag, including ones consumed only by the foreign delegate. Backend lowering rejected those, since it requires every user of a tagged constant to share that tag while the foreign delegate's call keeps the prior one. Now only genuinely unused constants are tagged here —
tag_constant_dataalready claims the ones this partition uses, and a constant feeding only a prior delegate is left untagged. Mirrored in fbcode and xplat.Differential Revision: D107690797