Track segment replication state in SegmentPartitionMetadataManager to avoid partition routing failures#18744
Conversation
6ce713d to
66aa4b7
Compare
… avoid partition routing failures Fixes spurious "Failed to find enabled fully replicated server for table: X, partition: N" MSE failures on partitioned tables. 1. Track per-segment replication state against the latest IdealState assignment. A segment that has never been fully replicated per its current target assignment is excluded from the partition info if it would reduce the fully replicated servers. Previously this protection was keyed off the segment creation/push time only: segments taking longer than the window to replicate (e.g. large upsert segments still downloading/preloading), and segments uploaded under a new name to replace other segments, would start reducing the fully replicated servers the moment the window lapsed, potentially down to the empty set, failing every MSE query on the partition. The pending state is now keyed off the time the segment's ideal assignment last changed (so it also covers segment moves), and is still bounded by pinot.broker.new.segment.expiration.seconds so that a replica permanently failing to load cannot silently exclude a segment from being served forever. Once a segment has been fully replicated, losing a replica reduces the fully replicated servers as before, so queries are routed to the remaining replicas. 2. A segment with no available replica no longer clears the partition's fully replicated servers. No server can serve such a segment regardless of routing, so clearing the set turned one bad segment (all replicas in ERROR state, or a deletion race) into a full partition outage. It is now excluded from the partition info, reported via TablePartitionReplicatedServersInfo.getUnavailableSegments(), and attached to the query response as a SERVER_SEGMENT_MISSING exception through DispatchablePlanMetadata (same mechanism as the non-partitioned leaf stage, honoring the ignoreMissingSegments query option). 3. Add missing pinot-java-client test dependency to pinot-broker. Since apache#17167 made ControllerTest reference PinotAdminException (test scoped in pinot-controller, thus not transitive), every ControllerTest derived test class in pinot-broker failed to load and silently ran 0 tests (SegmentPartitionMetadataManagerTest, TimeBoundaryManagerTest, SegmentPrunerTest, SegmentZkMetadataFetcherTest, BrokerRoutingManagerConcurrencyTest, RemoteClusterBrokerRoutingManagerTest). With the dependency restored, all of them run and pass again. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
66aa4b7 to
d51e13f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18744 +/- ##
============================================
+ Coverage 64.51% 65.49% +0.98%
- Complexity 1291 1309 +18
============================================
Files 3372 3381 +9
Lines 208638 209857 +1219
Branches 32596 32816 +220
============================================
+ Hits 134604 137449 +2845
+ Misses 63239 61356 -1883
- Partials 10795 11052 +257
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR hardens partitioned-table routing for multi-stage engine (MSE) queries by making SegmentPartitionMetadataManager aware of per-segment replication progress (vs the current IdealState assignment) and by surfacing segments with no available replicas to the query response (instead of turning them into full partition routing failures). It also fixes pinot-broker test execution by restoring a missing test-scoped dependency.
Changes:
- Track per-segment replication state against IdealState in
SegmentPartitionMetadataManagerto avoid “cliff” failures when segments aren’t yet fully replicated or are replaced/moved. - Propagate “unavailable segments” (no available replica) into
DispatchablePlanMetadataso the broker can reportSERVER_SEGMENT_MISSINGconsistently for partitioned routing. - Restore
pinot-java-clientas a test dependency forpinot-brokerand update impacted tests/utilities for the newTablePartitionReplicatedServersInfoshape.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-query-planner/src/test/java/org/apache/pinot/query/QueryEnvironmentTestBase.java | Updates test scaffolding for the expanded TablePartitionReplicatedServersInfo constructor. |
| pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java | Attaches unavailable segments into dispatch metadata so they can be surfaced in query responses. |
| pinot-core/src/main/java/org/apache/pinot/core/routing/TablePartitionReplicatedServersInfo.java | Adds an unavailableSegments field + getter to carry “no available replica” information. |
| pinot-broker/src/test/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManagerTest.java | Adds new tests for replication-state tracking and pending-expiration behavior. |
| pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManager.java | Implements replication-state tracking keyed to IdealState changes; excludes pending segments; records unavailable segments. |
| pinot-broker/pom.xml | Restores pinot-java-client test dependency so broker tests execute instead of silently running 0 tests. |
| * 1. For each partition ID, what are the servers that contains ALL segments belong to this partition ID. | ||
| * 2. For each server, what are all the partition IDs and list of segments of those partition IDs on this server. | ||
| */ | ||
| /// The `PartitionDataManager` manages partitions of a table. It manages |
| /// forever. Segments without any available replica are also excluded because they cannot be served regardless of the | ||
| /// server picked, and are reported via [TablePartitionReplicatedServersInfo#getUnavailableSegments()]. |
| // Attach unavailable segments (no available replica, excluded from the partition info) to the metadata so that | ||
| // they are reported in the query response |
| /// Returns the segments excluded from the partition info because they have no available replica, thus cannot be | ||
| /// served regardless of the server picked. |
|
|
||
| // After the pending state expires, segmentY should be treated as a regular segment and reduce the fully | ||
| // replicated servers, so that its data is still served from the available replica | ||
| Thread.sleep(10); |
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal routing correctness issue; see inline comment.
| if (isServingState(entry.getValue())) { | ||
| String server = entry.getKey(); | ||
| numTargetServers++; | ||
| targetServersHash += server.hashCode(); |
There was a problem hiding this comment.
This target-set fingerprint is not unique enough for assignment changes. For example, {Server_localhost_8000, Server_localhost_8003} and {Server_localhost_8001, Server_localhost_8002} have the same count and summed String.hashCode(), so a rebalance between them will not reset _fullyReplicatedOnce. That lets a moved segment reduce the fully replicated server set immediately and can reintroduce the partition routing failures this PR is fixing. Please track the actual target server set, or use a collision-resistant fingerprint, before treating the assignment as unchanged.
Problem
MSE queries on partitioned tables can fail with:
SegmentPartitionMetadataManagercomputes the per-partition fully replicated servers by intersecting the ExternalView ONLINE/CONSUMING server sets across all non-"new" segments. Two issues make this fragile:pinot.broker.new.segment.expiration.seconds, made configurable in Make new segment expiration time configurable in SegmentPartitionMetadataManager #18707). Segments that take longer than the window to become fully replicated (e.g. large upsert segments still downloading/preloading), and segments uploaded under a new name to replace other segments (e.g. via merge/rollup or compaction style tasks using the segment replacement protocol), start reducing the fully replicated servers the moment the window lapses — potentially down to the empty set, which fails every MSE query on the partition. Raising the config only moves the cliff.Fix
pinot.broker.new.segment.expiration.secondsso a replica permanently failing to load cannot silently exclude a segment from being served forever. Once fully replicated, losing a replica reduces the fully replicated servers as before, so queries are routed to the remaining replicas. When target servers cannot be determined from the ideal state, the previous creation-time based check is used as fallback.SERVER_SEGMENT_MISSINGexception viaDispatchablePlanMetadata#addUnavailableSegments— the same mechanism the non-partitioned leaf stage uses, honoring theignoreMissingSegmentsquery option.pinot-java-clienttest dependency topinot-broker. Since Add typed admin client APIs with response classes in pinot-common; migrate tests and connectors to PinotAdminClient #17167 madeControllerTestreferencePinotAdminException(test scoped inpinot-controller, thus not transitive), everyControllerTest-derived test class inpinot-brokerfailed to load and silently ran 0 tests in CI (SegmentPartitionMetadataManagerTest,TimeBoundaryManagerTest,SegmentPrunerTest,SegmentZkMetadataFetcherTest,BrokerRoutingManagerConcurrencyTest,RemoteClusterBrokerRoutingManagerTest— TestNG swallows theNoClassDefFoundErrorand reportsTests run: 0). With the dependency restored, all of them run and pass again.Testing
testReplicationStateTracking: never-replicated old segment is excluded instead of reducing the fully replicated servers; replica loss after full replication reduces the set; full replica loss excludes + reports the segment without clearing the set; recovery re-includes it; ideal assignment change resets the tracking.testPendingSegmentExpiration: after the pending window lapses, the segment falls back to reducing the fully replicated servers so its data is still served from the available replica.pinot-broker(316 tests) andpinot-query-planner(1263 tests) suites pass.🤖 Generated with Claude Code