Skip to content

refactor: add try_to_proto to HashTableLookupExpr#22451

Open
AnuragRaut08 wants to merge 5 commits into
apache:mainfrom
AnuragRaut08:feat/port-hash-table-lookup-expr-proto
Open

refactor: add try_to_proto to HashTableLookupExpr#22451
AnuragRaut08 wants to merge 5 commits into
apache:mainfrom
AnuragRaut08:feat/port-hash-table-lookup-expr-proto

Conversation

@AnuragRaut08
Copy link
Copy Markdown

Which issue does this PR close?

Part of #22435

What changes are included?

Adds try_to_proto to HashTableLookupExpr so it participates in the
expression-local serialization pattern introduced in #21929.

HashTableLookupExpr holds a runtime Arc<Map> that cannot be
serialized, so try_to_proto replaces it with lit(true). This is
safe because the filter is a performance optimisation only — lit(true)
passes all rows and the join produces correct results either way.

The centralized arm in to_proto.rs remains as a fallback for now.
Cleanup can follow in a separate PR once this lands.

Are these changes tested?

Yes — covered by the existing roundtrip_hash_table_lookup_expr_to_lit
test in datafusion/proto/tests/cases/roundtrip_physical_plan.rs.

Are there any user-facing changes?

No.

Part of apache#22435. Adds try_to_proto to HashTableLookupExpr so it
participates in the expression-local serialization pattern introduced
in apache#21929.

HashTableLookupExpr holds a runtime Arc<Map> that cannot be serialized,
so try_to_proto replaces it with lit(true). This is safe because the
filter is a performance optimisation only — lit(true) passes all rows
and the join produces correct results either way.

The centralized arm in to_proto.rs remains as a fallback for now.
@github-actions github-actions Bot added the physical-plan Changes to the physical-plan crate label May 22, 2026
@AnuragRaut08
Copy link
Copy Markdown
Author

Thanks! CI is currently running. Please let me know if any adjustments are needed — happy to iterate quickly.

/// - `lit(true)` passes all rows so no valid rows are lost.
/// - In distributed execution the remote worker has no access to the
/// build-side hash table anyway.
pub fn try_to_proto(
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this must be moved to the impl PhysicalExpr for HashTableLookupExpr block to actually be used.

The way to test this code is being used is to delete the old paths in datafusion/datafusion/proto/src/physical_plan/to_proto.rs and datafusion/datafusion/proto/src/physical_plan/from_proto.rs.

Since this is a small PR, you may want to considering adding try_from_proto as well :)

/// - The filter is a performance optimisation, not a correctness requirement.
/// - `lit(true)` passes all rows so no valid rows are lost.
/// - In distributed execution the remote worker has no access to the
/// build-side hash table anyway.
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment comes from roundtrip_hash_table_lookup_expr_to_lit() in /datafusion/datafusion/proto/tests/cases/roundtrip_physical_plan.rs but I don't think it's very true.

IIUC, hash joins might build a HashTableLookupExpr during execution after the build side is done. These expressions get placed in the dynamic filter expr.

If you serialize before executing the plan, then there's no code path where there would be a HashTableLookupExpr in the plan today. If you deserialize and execute that plan, then the HashJoinExec may create a fresh HashTableLookupExpr for the dynamic filter. In this case, all the row pruning is preserved.

If you serialize after executing, then any potential HashTableLookupExpr would be replaced with lit(true). I don't think this has any impact. One must call reset_state to re-execute plans, in which case I would expect the HashTableLookupExpr to disappear. In this case, all the row pruning is preserved as well.

Maybe we can change the comment to explain these two cases? ^

Since HashTableLookupExpr is public it might be good to warn users that it does not get serialized. We could (a) file a ticket to track that HashTableLookupExpr are not serialized and (b) add a comment directly on HashTableLookupExpr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants