Skip to content

fix: clamp max_new_tokens on retry to prevent response_length overflow#2003

Open
YaoweiFan wants to merge 1 commit into
THUDM:mainfrom
YaoweiFan:fix/fully-async-retry-max-new-tokens
Open

fix: clamp max_new_tokens on retry to prevent response_length overflow#2003
YaoweiFan wants to merge 1 commit into
THUDM:mainfrom
YaoweiFan:fix/fully-async-retry-max-new-tokens

Conversation

@YaoweiFan
Copy link
Copy Markdown

@YaoweiFan YaoweiFan commented Jun 1, 2026

Summary

ABORTED samples are pushed back into the data buffer by fully_async_rollout's done callback and re-picked by the worker, but their sample.tokens / response_length carry over the partial output from the first attempt. Meanwhile the per-call sampling_params is a shallow copy of GenerateState.sampling_params, whose max_new_tokens is fixed at args.rollout_max_response_len and never decremented. On retry sglang sees a new context and generates up to another full rollout_max_response_len tokens, so sample.response_length += new_tokens can climb to ~2x the configured cap (or higher across retry chains), breaking the training-side per-GPU token budget.

This PR clamps max_new_tokens to (rollout_max_response_len - response_length) at the entry of generate(), and marks the sample TRUNCATED when the budget is already exhausted. Mirrors the retool multi-turn fix in #1861.

ABORTED samples are pushed back into the data buffer by
fully_async_rollout's done callback and re-picked by the worker, but
their sample.tokens / response_length carry over the partial output from
the first attempt. Meanwhile the per-call sampling_params is a shallow
copy of GenerateState.sampling_params, whose max_new_tokens is fixed at
args.rollout_max_response_len and never decremented. On retry sglang
sees a new context and generates up to another full
rollout_max_response_len tokens, so sample.response_length += new_tokens
can climb to ~2x the configured cap (or higher across retry chains),
breaking the training-side per-GPU token budget.

Clamp max_new_tokens to (rollout_max_response_len - response_length) at
the entry of generate(), and mark the sample TRUNCATED when the budget
is already exhausted. Mirrors the retool multi-turn fix in THUDM#1861.

Symptoms previously surfaced as the loss_mask/response_length mismatch
in THUDM#1440 (response_len/max=16384 with rollout_max_response_len=8192) and
the partial-resume TODO noted in examples/fully_async/README.md after
THUDM#1920.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant