Skip to content

Track segment replication state in SegmentPartitionMetadataManager to avoid partition routing failures#18744

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:fix-partition-metadata-replication-tracking
Open

Track segment replication state in SegmentPartitionMetadataManager to avoid partition routing failures#18744
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:fix-partition-metadata-replication-tracking

Conversation

@xiangfu0

Copy link
Copy Markdown
Contributor

Problem

MSE queries on partitioned tables can fail with:

Failed to find enabled fully replicated server for table: X, partition: N

SegmentPartitionMetadataManager computes the per-partition fully replicated servers by intersecting the ExternalView ONLINE/CONSUMING server sets across all non-"new" segments. Two issues make this fragile:

  1. The "new segment" protection is purely time based (pinot.broker.new.segment.expiration.seconds, made configurable in Make new segment expiration time configurable in SegmentPartitionMetadataManager #18707). Segments that take longer than the window to become fully replicated (e.g. large upsert segments still downloading/preloading), and segments uploaded under a new name to replace other segments (e.g. via merge/rollup or compaction style tasks using the segment replacement protocol), start reducing the fully replicated servers the moment the window lapses — potentially down to the empty set, which fails every MSE query on the partition. Raising the config only moves the cliff.
  2. A segment with no available replica clears the partition's fully replicated servers entirely. No server can serve such a segment regardless of which server is picked, so clearing the set turns one bad segment (all replicas in ERROR state, or a transient deletion race between EV/IS updates) into a full partition outage instead of a partial-data response.

Fix

  1. Track per-segment replication state against the latest IdealState assignment (the IdealState was already passed in but unused). A segment that has never been observed fully replicated per its current target assignment is "pending" and excluded from the partition info if it would reduce the fully replicated servers. The pending clock is keyed to when the segment's ideal assignment last changed (covering new segments, replaced segments, and rebalance moves alike), and is still bounded by pinot.broker.new.segment.expiration.seconds so a replica permanently failing to load cannot silently exclude a segment from being served forever. Once fully replicated, losing a replica reduces the fully replicated servers as before, so queries are routed to the remaining replicas. When target servers cannot be determined from the ideal state, the previous creation-time based check is used as fallback.
  2. Exclude unavailable segments instead of clearing the partition's fully replicated servers, and report them in the query response as a SERVER_SEGMENT_MISSING exception via DispatchablePlanMetadata#addUnavailableSegments — the same mechanism the non-partitioned leaf stage uses, honoring the ignoreMissingSegments query option.
  3. Add the missing pinot-java-client test dependency to pinot-broker. Since Add typed admin client APIs with response classes in pinot-common; migrate tests and connectors to PinotAdminClient #17167 made ControllerTest reference PinotAdminException (test scoped in pinot-controller, thus not transitive), every ControllerTest-derived test class in pinot-broker failed to load and silently ran 0 tests in CI (SegmentPartitionMetadataManagerTest, TimeBoundaryManagerTest, SegmentPrunerTest, SegmentZkMetadataFetcherTest, BrokerRoutingManagerConcurrencyTest, RemoteClusterBrokerRoutingManagerTest — TestNG swallows the NoClassDefFoundError and reports Tests run: 0). With the dependency restored, all of them run and pass again.

Testing

  • New testReplicationStateTracking: never-replicated old segment is excluded instead of reducing the fully replicated servers; replica loss after full replication reduces the set; full replica loss excludes + reports the segment without clearing the set; recovery re-includes it; ideal assignment change resets the tracking.
  • New testPendingSegmentExpiration: after the pending window lapses, the segment falls back to reducing the fully replicated servers so its data is still served from the available replica.
  • Full pinot-broker (316 tests) and pinot-query-planner (1263 tests) suites pass.

🤖 Generated with Claude Code

@xiangfu0 xiangfu0 added bug Something is not working as expected multi-stage Related to the multi-stage query engine labels Jun 12, 2026
@xiangfu0 xiangfu0 force-pushed the fix-partition-metadata-replication-tracking branch from 6ce713d to 66aa4b7 Compare June 12, 2026 02:31
… avoid partition routing failures

Fixes spurious "Failed to find enabled fully replicated server for table: X,
partition: N" MSE failures on partitioned tables.

1. Track per-segment replication state against the latest IdealState
   assignment. A segment that has never been fully replicated per its current
   target assignment is excluded from the partition info if it would reduce
   the fully replicated servers. Previously this protection was keyed off the
   segment creation/push time only: segments taking longer than the window to
   replicate (e.g. large upsert segments still downloading/preloading), and
   segments uploaded under a new name to replace other segments, would start
   reducing the fully replicated servers the moment the window lapsed,
   potentially down to the empty set, failing every MSE query on the
   partition. The pending state is now keyed off the time the segment's ideal
   assignment last changed (so it also covers segment moves), and is still
   bounded by pinot.broker.new.segment.expiration.seconds so that a replica
   permanently failing to load cannot silently exclude a segment from being
   served forever. Once a segment has been fully replicated, losing a replica
   reduces the fully replicated servers as before, so queries are routed to
   the remaining replicas.

2. A segment with no available replica no longer clears the partition's fully
   replicated servers. No server can serve such a segment regardless of
   routing, so clearing the set turned one bad segment (all replicas in ERROR
   state, or a deletion race) into a full partition outage. It is now excluded
   from the partition info, reported via
   TablePartitionReplicatedServersInfo.getUnavailableSegments(), and attached
   to the query response as a SERVER_SEGMENT_MISSING exception through
   DispatchablePlanMetadata (same mechanism as the non-partitioned leaf stage,
   honoring the ignoreMissingSegments query option).

3. Add missing pinot-java-client test dependency to pinot-broker. Since apache#17167
   made ControllerTest reference PinotAdminException (test scoped in
   pinot-controller, thus not transitive), every ControllerTest derived test
   class in pinot-broker failed to load and silently ran 0 tests
   (SegmentPartitionMetadataManagerTest, TimeBoundaryManagerTest,
   SegmentPrunerTest, SegmentZkMetadataFetcherTest,
   BrokerRoutingManagerConcurrencyTest, RemoteClusterBrokerRoutingManagerTest).
   With the dependency restored, all of them run and pass again.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the fix-partition-metadata-replication-tracking branch from 66aa4b7 to d51e13f Compare June 12, 2026 02:45
@codecov-commenter

codecov-commenter commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.77778% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.49%. Comparing base (4ff9dbe) to head (d51e13f).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
...mentpartition/SegmentPartitionMetadataManager.java 85.45% 5 Missing and 3 partials ⚠️
.../org/apache/pinot/query/routing/WorkerManager.java 42.85% 6 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18744      +/-   ##
============================================
+ Coverage     64.51%   65.49%   +0.98%     
- Complexity     1291     1309      +18     
============================================
  Files          3372     3381       +9     
  Lines        208638   209857    +1219     
  Branches      32596    32816     +220     
============================================
+ Hits         134604   137449    +2845     
+ Misses        63239    61356    -1883     
- Partials      10795    11052     +257     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 65.49% <77.77%> (+0.98%) ⬆️
temurin 65.49% <77.77%> (+0.98%) ⬆️
unittests 65.49% <77.77%> (+0.98%) ⬆️
unittests1 56.98% <52.94%> (+0.06%) ⬆️
unittests2 38.09% <69.44%> (+0.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens partitioned-table routing for multi-stage engine (MSE) queries by making SegmentPartitionMetadataManager aware of per-segment replication progress (vs the current IdealState assignment) and by surfacing segments with no available replicas to the query response (instead of turning them into full partition routing failures). It also fixes pinot-broker test execution by restoring a missing test-scoped dependency.

Changes:

  • Track per-segment replication state against IdealState in SegmentPartitionMetadataManager to avoid “cliff” failures when segments aren’t yet fully replicated or are replaced/moved.
  • Propagate “unavailable segments” (no available replica) into DispatchablePlanMetadata so the broker can report SERVER_SEGMENT_MISSING consistently for partitioned routing.
  • Restore pinot-java-client as a test dependency for pinot-broker and update impacted tests/utilities for the new TablePartitionReplicatedServersInfo shape.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pinot-query-planner/src/test/java/org/apache/pinot/query/QueryEnvironmentTestBase.java Updates test scaffolding for the expanded TablePartitionReplicatedServersInfo constructor.
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java Attaches unavailable segments into dispatch metadata so they can be surfaced in query responses.
pinot-core/src/main/java/org/apache/pinot/core/routing/TablePartitionReplicatedServersInfo.java Adds an unavailableSegments field + getter to carry “no available replica” information.
pinot-broker/src/test/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManagerTest.java Adds new tests for replication-state tracking and pending-expiration behavior.
pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManager.java Implements replication-state tracking keyed to IdealState changes; excludes pending segments; records unavailable segments.
pinot-broker/pom.xml Restores pinot-java-client test dependency so broker tests execute instead of silently running 0 tests.

* 1. For each partition ID, what are the servers that contains ALL segments belong to this partition ID.
* 2. For each server, what are all the partition IDs and list of segments of those partition IDs on this server.
*/
/// The `PartitionDataManager` manages partitions of a table. It manages
Comment on lines +60 to +61
/// forever. Segments without any available replica are also excluded because they cannot be served regardless of the
/// server picked, and are reported via [TablePartitionReplicatedServersInfo#getUnavailableSegments()].
Comment on lines +907 to +908
// Attach unavailable segments (no available replica, excluded from the partition info) to the metadata so that
// they are reported in the query response
Comment on lines +74 to +75
/// Returns the segments excluded from the partition info because they have no available replica, thus cannot be
/// served regardless of the server picked.

// After the pending state expires, segmentY should be treated as a regular segment and reduce the fully
// replicated servers, so that its data is still served from the available replica
Thread.sleep(10);

@xiangfu0 xiangfu0 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal routing correctness issue; see inline comment.

if (isServingState(entry.getValue())) {
String server = entry.getKey();
numTargetServers++;
targetServersHash += server.hashCode();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This target-set fingerprint is not unique enough for assignment changes. For example, {Server_localhost_8000, Server_localhost_8003} and {Server_localhost_8001, Server_localhost_8002} have the same count and summed String.hashCode(), so a rebalance between them will not reset _fullyReplicatedOnce. That lets a moved segment reduce the fully replicated server set immediately and can reintroduce the partition routing failures this PR is fixing. Please track the actual target server set, or use a collision-resistant fingerprint, before treating the assignment as unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is not working as expected multi-stage Related to the multi-stage query engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants