fix: clamp max_new_tokens on retry to prevent response_length overflow by YaoweiFan · Pull Request #2003 · THUDM/slime

YaoweiFan · 2026-06-01T12:12:35Z

Summary

ABORTED samples are pushed back into the data buffer by fully_async_rollout's done callback and re-picked by the worker, but their sample.tokens / response_length carry over the partial output from the first attempt. Meanwhile the per-call sampling_params is a shallow copy of GenerateState.sampling_params, whose max_new_tokens is fixed at args.rollout_max_response_len and never decremented. On retry sglang sees a new context and generates up to another full rollout_max_response_len tokens, so sample.response_length += new_tokens can climb to ~2x the configured cap (or higher across retry chains), breaking the training-side per-GPU token budget.

This PR clamps max_new_tokens to (rollout_max_response_len - response_length) at the entry of generate(), and marks the sample TRUNCATED when the budget is already exhausted. Mirrors the retool multi-turn fix in #1861.

ABORTED samples are pushed back into the data buffer by fully_async_rollout's done callback and re-picked by the worker, but their sample.tokens / response_length carry over the partial output from the first attempt. Meanwhile the per-call sampling_params is a shallow copy of GenerateState.sampling_params, whose max_new_tokens is fixed at args.rollout_max_response_len and never decremented. On retry sglang sees a new context and generates up to another full rollout_max_response_len tokens, so sample.response_length += new_tokens can climb to ~2x the configured cap (or higher across retry chains), breaking the training-side per-GPU token budget. Clamp max_new_tokens to (rollout_max_response_len - response_length) at the entry of generate(), and mark the sample TRUNCATED when the budget is already exhausted. Mirrors the retool multi-turn fix in THUDM#1861. Symptoms previously surfaced as the loss_mask/response_length mismatch in THUDM#1440 (response_len/max=16384 with rollout_max_response_len=8192) and the partial-resume TODO noted in examples/fully_async/README.md after THUDM#1920.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clamp max_new_tokens on retry to prevent response_length overflow#2003

fix: clamp max_new_tokens on retry to prevent response_length overflow#2003
YaoweiFan wants to merge 1 commit into
THUDM:mainfrom
YaoweiFan:fix/fully-async-retry-max-new-tokens

YaoweiFan commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YaoweiFan commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YaoweiFan commented Jun 1, 2026 •

edited

Loading