[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy by loserwang1024 · Pull Request #3390 · apache/fluss

loserwang1024 · 2026-05-27T12:55:27Z

Purpose

Linked issue: close #3389

Brief change log

Tests

API and Format

Documentation

…ewayClientProxy

loserwang1024 · 2026-05-27T12:56:27Z

@swuferhong @wuchong @fresh-borzoni , CC

fresh-borzoni

@loserwang1024 Thank you for the very important PR, left some comments, PTAL

fresh-borzoni · 2026-05-29T01:18:33Z

    private final AdminReadOnlyGateway readOnlyGateway;
    private final MetadataUpdater metadataUpdater;

+    private static final int READ_ONLY_GATEWAY_MAX_RETRIES = 3;


With maxRetries=3, bootstrap reinit needs 4 refreshes. You only get 3 per request.
Shall we loop inside updateMetadata until either success or null-triggered bootstrap?

fresh-borzoni · 2026-05-29T01:23:53Z

+                            cause);
+                    // Run metadata refresh and retry on a separate thread to avoid
+                    // blocking Netty IO threads that may complete the failed future.
+                    CompletableFuture.runAsync(


do we want some backoff?
I mean 3 retries fire in milliseconds, seems wasteful on slow DNS or restarting pods.

fresh-borzoni · 2026-05-29T01:25:15Z

+ *   <li>Metadata refresh is triggered, which marks the failed server as unavailable
+ *   <li>After N failed refreshes, all servers are marked unavailable, triggering re-initialization
+ *       from bootstrap servers
+ *   <li>The next retry succeeds with the refreshed server addresses


ditto: only true when maxRetries > cluster_size

fresh-borzoni · 2026-05-29T01:26:25Z

                GatewayClientProxy.createGatewayProxy(
                        metadataUpdater::getCoordinatorServer, client, AdminGateway.class);
-        this.readOnlyGateway =
+        AdminGateway rawReadOnlyGateway =


Shall we add TODO for writes, since they are still broken?

fresh-borzoni · 2026-05-29T01:27:35Z

    private final AdminReadOnlyGateway readOnlyGateway;
    private final MetadataUpdater metadataUpdater;

+    private static final int READ_ONLY_GATEWAY_MAX_RETRIES = 3;


Shall we make it a ConfigOption to make more operator-friendly?

fresh-borzoni · 2026-05-29T01:29:10Z

-        this.readOnlyGateway =
+        AdminGateway rawReadOnlyGateway =
                GatewayClientProxy.createGatewayProxy(
                        metadataUpdater::getRandomTabletServer, client, AdminGateway.class);


AdminReadOnlyGateway.class?

fresh-borzoni · 2026-05-29T01:59:55Z

+                            cause);
+                    // Run metadata refresh and retry on a separate thread to avoid
+                    // blocking Netty IO threads that may complete the failed future.
+                    CompletableFuture.runAsync(


runAsync without an executor uses ForkJoinPool.commonPool(), should we use a dedicated executor instead?

fresh-borzoni · 2026-05-29T02:05:03Z

+                            cause);
+                    // Run metadata refresh and retry on a separate thread to avoid
+                    // blocking Netty IO threads that may complete the failed future.
+                    CompletableFuture.runAsync(


Every retry here fires its own updateMetadata call, and that method's synchronized(this) block is the same paths use to refresh leader info.
Example: during a rolling upgrade, N concurrent failing admin calls × 3 retries all queue up behind one lock, and the data plane's refreshes wait in the same line.

Could we share one in-flight refresh across concurrent retriers?

loserwang1024 · 2026-06-01T11:12:15Z

@fresh-borzoni , I've revised the design: instead of retrying 3 times, it now rebuilds metadata via refreshClusterUntilAvailable until either some IP becomes available or it falls back to bootstrap. No backoff for now. Keep refreshClusterUntilAvailable purely "loop until available or bootstrap" to avoid over-engineering. The two existing layers (connection timeout + bootstrap exponential backoff) already provide sufficient throttling.

[client] Fix stale metadata on readOnlyGateway by adding RetryableGat…

662772d

…ewayClientProxy

fresh-borzoni reviewed May 29, 2026

View reviewed changes

modified based on advice.

afb35e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390

[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390
loserwang1024 wants to merge 2 commits into
apache:mainfrom
loserwang1024:retry-with-retry

loserwang1024 commented May 27, 2026

Uh oh!

loserwang1024 commented May 27, 2026

Uh oh!

fresh-borzoni left a comment

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

fresh-borzoni May 29, 2026

Uh oh!

loserwang1024 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loserwang1024 commented May 27, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

loserwang1024 commented May 27, 2026

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loserwang1024 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants