Skip to content

Add retry for extension register#733

Closed
shreyamalpani wants to merge 3 commits into
aws:mainfrom
shreyamalpani:retry-extension-register
Closed

Add retry for extension register#733
shreyamalpani wants to merge 3 commits into
aws:mainfrom
shreyamalpani:retry-extension-register

Conversation

@shreyamalpani
Copy link
Copy Markdown

@shreyamalpani shreyamalpani commented May 27, 2026

Issue #, if available:
Datadog Lambda Extension crashing with v1.0.0 (Github Issue)

Description of changes:
LWA 1.0.0 changed to read AWS_LWA_LAMBDA_RUNTIME_API_PROXY on startup

// Apply runtime proxy configuration BEFORE starting tokio runtime
// This must happen before any threads are spawned to avoid unsafe env::set_var
Adapter::apply_runtime_proxy_config();

When LWA immediately POSTs /extension/register to that proxy address, it exits the process immediately since the the Datadog extension port from AWS_LWA_LAMBDA_RUNTIME_API_PROXY isn’t available yet, as the extension is not ready early enough.

This PR adds retry to the extension registration POST so LWA can connect once the extension's proxy port is ready. Uses exponential backoff with jitter, default base 20ms, configurable via AWS_LWA_EXTENSION_REGISTER_BASE_MS. Added tests covering retry on extension registration failure.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shreyamalpani shreyamalpani marked this pull request as ready for review May 27, 2026 16:38
@bnusunny
Copy link
Copy Markdown
Contributor

bnusunny commented May 27, 2026

Thanks for the fix — this is the right direction for the Datadog extension issue. A few comments before merge.

Requested before merge

1. Document the new env var

AWS_LWA_EXTENSION_REGISTER_BASE_MS isn't listed in:

  • README.md (the env table around line 68)
  • docs/guide/src/configuration/environment-variables.md

Commit 2912565 set the expectation that env vars get documented in lockstep. Please add a row to both with the default (20 ms) and a one-line description of when to tune it.

2. Guard against base_ms <= 1

let base_ms: u64 = env::var(ENV_EXTENSION_REGISTER_BASE_MS)
    .ok()
    .and_then(|s| s.parse().ok())
    .unwrap_or(EXTENSION_REGISTER_BASE_MS);
let strategy = ExponentialBackoff::from_millis(2)
    .factor(base_ms / 2)
    ...

If a user sets AWS_LWA_EXTENSION_REGISTER_BASE_MS=0 or =1, base_ms / 2 == 0 collapses every delay to 0 ms and we hammer the proxy as fast as the loop turns. Suggest clamping:

let base_ms: u64 = env::var(ENV_EXTENSION_REGISTER_BASE_MS)
    .ok()
    .and_then(|s| s.parse().ok())
    .unwrap_or(EXTENSION_REGISTER_BASE_MS)
    .max(2);

While you're there, a one-line comment explaining the from_millis(2).factor(base_ms/2) trick would help future readers — at first glance from_millis(20) looks like the natural form, but that produces 20, 400, 8000 ms because from_millis(b) yields b, b², b³, …. The current form deliberately produces base_ms, 2·base_ms, 4·base_ms, ….

Nice-to-have, not blocking

3. Closure capture style (taste)

let client_ref = &client;
let api_ref: &str = &aws_lambda_runtime_api;
let register_res = Retry::spawn(strategy, || async move { ... }).await?;

The _ref aliases in front of async move read a bit unusual. A plain || async { ... } capturing &client / &aws_lambda_runtime_api directly works without the intermediate bindings. Pure stylistic.

4. Tighten the retry-count assertion

assert!(hits >= 3, ...);

The mock returns 500 for the first two hits and 200 for the third, after which retry stops, so this should be exactly 3. assert_eq!(hits, 3, ...) would catch a regression where retry kept firing past success.

Comment thread src/lib.rs
tracing::warn!(error = %e, "extension registration attempt failed, will retry");
Err(Error::from(e))
}
Ok(res) if res.status() != StatusCode::OK => {
Copy link
Copy Markdown

@seshubaws seshubaws May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems now all non 200 statuses will be retried 5 times. However, what about the case of a deterministic 4xx (e.g., a malformed request shape from a future bug) which will also retry five times before exiting. We should consider short-circuiting on 4xx (except 408/429) to fail fast on non-transient errors:

Ok(res) if res.status().is_client_error() && res.status() != StatusCode::TOO_MANY_REQUESTS => Ok(res),

Comment thread src/lib.rs
// Extensions proxying AWS_LAMBDA_RUNTIME_API may not have bound their socket when LWA registers, retry the POST with exponential backoff to allow them to bind.
let base_ms: u64 = env::var(ENV_EXTENSION_REGISTER_BASE_MS)
.ok()
.and_then(|s| s.parse().ok())
Copy link
Copy Markdown

@seshubaws seshubaws May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would swallow typos too, can we add a warn on parse failures? could be something like

let base_ms: u64 = match env::var(ENV_EXTENSION_REGISTER_BASE_MS) {
    Ok(raw) => match raw.parse::<u64>() {
        Ok(n) => n,
        Err(e) => {
            tracing::warn!(
                value = %raw,
                error = %e,
                "invalid {ENV_EXTENSION_REGISTER_BASE_MS}, using default {EXTENSION_REGISTER_BASE_MS}ms"
            );
            EXTENSION_REGISTER_BASE_MS
        }
    },
    Err(_) => EXTENSION_REGISTER_BASE_MS, // unset is fine, no warning
};

@bnusunny
Copy link
Copy Markdown
Contributor

This issue is addressed in #737. Closing.

@bnusunny bnusunny closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants