test: Add local worker execution and statistics testing infrastructure#6697
Draft
test: Add local worker execution and statistics testing infrastructure#6697
Conversation
Add tests that verify the full interaction between distributed plan node task production, local execution stats emission, and centralized aggregation through the StatisticsManager. Key changes: - Add `LocalSwordfishWorker`: a Worker implementation that executes SwordfishTasks in-process via the local execution pipeline, producing real ExecutionStats with genuine StatSnapshot data. - Add `execute_local_plan()` public API to daft-local-execution for running a LocalPhysicalPlan and getting results + stats. - Fix `get_context()` for non-Python builds to return a default DaftContext instead of panicking. - Add `with_execution_stats()` to MockTaskBuilder for richer mock tasks. - Make `PlanExecutionContext::new()` pub(crate) for test access. Tests cover: - Sort single-partition stats: verifies phase-based filtering where only FINAL_SORT_PHASE Default snapshots contribute row counts - Filter stats with selectivity: verifies FilterSnapshot aggregation with correct rows_in/out and selectivity percentage - Filter multi-partition: verifies correct aggregation across multiple tasks (partitions) All tests use produce_tasks() on real DistributedPipelineNode trees, ensuring the local plans come from actual pipeline node code rather than being hand-constructed. https://claude.ai/code/session_01FSdbkeidSkPZNWT3supLK9
Rust Dependency DiffHead: ✅ OK: Within budget.
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6697 +/- ##
==========================================
+ Coverage 75.02% 75.62% +0.59%
==========================================
Files 1067 1081 +14
Lines 147155 149729 +2574
==========================================
+ Hits 110408 113227 +2819
+ Misses 36747 36502 -245
🚀 New features to boost your workflow:
|
Extend the distributed stats integration tests with a multi-partition sort test that exercises all three sort phases (sample, repartition, final-sort) through real execution, verifying that SortStats correctly filters stats by phase to avoid double-counting. To enable this without Python/Ray: - Implement pure-Rust get_partition_boundaries_from_samples using MicroPartition::sort() + quantiles() (non-Python cfg) - Fix RepartitionSink non-Python finalize to produce partitions with data instead of panicking - Add ShufflePartitionMetadata.data field for non-Python mode to carry actual partition data (replacing ray.put() object refs) - Update execute_local_plan to return LocalPlanOutput enum handling both regular partitions and shuffle output - Update LocalSwordfishWorker to convert shuffle output into TaskOutput::ShuffleWrite with proper ShufflePartitionRef::Ray entries The test creates a 3-partition sort pipeline, calls produce_tasks() which drives the sort node's real execution_loop through all phases, and verifies the aggregated stats show rows only from the final-sort phase while duration spans all phases. https://claude.ai/code/session_01FSdbkeidSkPZNWT3supLK9
…exports Revert filter, in_memory_source, and sort modules to private visibility. Instead, add #[cfg(test)] re-exports of FilterNode, InMemorySourceNode, and SortNode from the pipeline_node module so test code can access them without widening the production API surface. Also remove unused imports from the test file. https://claude.ai/code/session_01FSdbkeidSkPZNWT3supLK9
Add comments to all #[cfg(not(feature = "python"))] paths that were added or modified to clarify they are only exercised by Rust-only test runs (cargo test). Production always enables the python feature. Affected paths: - daft-context get_context() non-Python impl - sort get_partition_boundaries_from_samples non-Python impl - RepartitionSink non-Python finalize path - ShufflePartitionMetadata.data field - execute_local_plan / LocalPlanOutput https://claude.ai/code/session_01FSdbkeidSkPZNWT3supLK9
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes Made
This PR adds comprehensive testing infrastructure for the distributed execution pipeline, focusing on statistics collection and aggregation:
New Components
LocalSwordfishWorker (
src/daft-distributed/src/scheduling/local_worker.rs)WorkerandWorkerManagertraits for local executionExecutionStatsStatistics Tests (
src/daft-distributed/src/statistics/tests.rs)test_sort_single_partition_stats: Validates Sort node correctly reports rows and durationtest_filter_stats_aggregation: Verifies Filter node selectivity calculation (60% for x > 2 on [1,2,3,4,5])test_filter_stats_multi_partition: Tests stats aggregation across multiple partitions (3 partitions with 9 total rows, 3 passing filter)Local Plan Execution API (
src/daft-local-execution/src/run.rs)execute_local_plan()function for running plans in-processNativeExecutormachineryModifications
src/daft-context/src/lib.rs: Implementget_context()for non-Python builds to support testingsrc/daft-distributed/src/pipeline_node/mod.rs: Exportfilterandin_memory_sourcemodules for test accesssrc/daft-distributed/src/scheduling/scheduler/scheduler_actor.rs: Addinto_task()method for test extractionsrc/daft-distributed/src/plan/runner.rs: MakePlanExecutionContext::new()public for testingsrc/daft-distributed/src/scheduling/mod.rs: Conditionally exportlocal_workermodule for testssrc/daft-distributed/src/statistics/mod.rs: Add test modulesrc/daft-local-execution/src/lib.rs: Exportexecute_local_planfunctionsrc/daft-distributed/Cargo.toml: Add dev dependencies (arrow, daft-core, daft-local-execution)Test Coverage
The new tests validate:
All tests use the local worker infrastructure to execute real local plans and collect genuine execution statistics.
https://claude.ai/code/session_01FSdbkeidSkPZNWT3supLK9