Skip to content

fix(server): gemma-4-12b-it defaults to enable_thinking, yielding verbose thinking and truncated/empty answers at normal max_tokens #352

@inureyes

Description

@inureyes

Problem / Background

gemma-4-12b-it-4bit is model_type: gemma4_unified (ModelType::Gemma4Unified). Its tokenizer carries the <|channel> / <channel|> added tokens, so on server startup tokenizer.infer_thinking_markers() resolves a think-marker pair through the multi-token (Gemma 4) branch and has_thinking() returns true. As a result the server flips the chat_template Jinja kwarg enable_thinking to true by default.

Verified code path:

  • src/tokenizer/mod.rs (infer_thinking_markers, around line 214). The multi-token branch (around line 242) returns a populated ThinkingMarkers when both <|channel> and <channel|> are present in the vocab, with think_start = "<|channel>thought" and think_end = "<channel|>".
  • src/server/startup.rs (around lines 1591 to 1622). When thinking_markers.has_thinking() is true the server calls chat_template.set_default_enable_thinking(true). The in-tree comment cites ml-explore/mlx-lm#1114 and the <|channel>thought / <channel|> markers "for Gemma 4 and friends".
  • Per-request chat_template_kwargs still win on conflict via merge_server_and_request (src/server/chat_template_kwargs.rs, around line 306), and OPEN_THINKING_SUFFIXES = ["<|channel>thought\n", "<think>\n"] (src/server/routes/chat.rs, around line 918) confirms the primed open-thinking block for both Gemma 4 and Qwen families.

This is the intended behavior introduced to mirror upstream ml-explore/mlx-lm#1114 (TokenizerWrapper.apply_chat_template defaulting enable_thinking=self.has_thinking). It is not a regression. It was observed while validating #350 / #351 (and #333), and is orthogonal to #350 (the placeholder-leak fix, now merged).

Symptom

With enable_thinking=true (the startup default for this model), gemma-4-12b-it produces very verbose thinking, so the answer content is truncated or empty at normal token budgets. Observed against gemma-4-12b-it-4bit on /v1/chat/completions, temperature 0:

  • Prompt "What is a hash table? Answer in one sentence." at max_tokens=80: empty content, finish_reason=length (the whole budget is consumed inside the thinking block).
  • Same prompt at max_tokens=400: a coherent one-sentence answer, but finish_reason=stop only after roughly 275 completion tokens, meaning about 245 tokens were spent thinking for a one-sentence answer.
  • Prompt "Explain in three sentences how a hash table achieves average O(1) lookup." at max_tokens=800: finish_reason=length, content truncated to roughly 33 characters (still not finished after 800 tokens).
  • With chat_template_kwargs={"enable_thinking": false}, the same prompts answer concisely and correctly with finish_reason=stop in well under 80 tokens.

So the verbosity and truncation are driven specifically by the thinking default, not by the model being broken.

Secondary observation (verified): reasoning_content is dropped on the non-streaming path

In the repro the non-streaming response carried no reasoning content even though about 245 thinking tokens were generated for the one-sentence case. This is confirmed at the code level, and it is not gemma-4-specific:

  • The non-streaming response message type ChatMessage (src/server/types/response.rs, around line 113) has only role, content, and tool_calls. There is no reasoning_content field. The non-streaming handler non_stream_chat_completion (src/server/routes/chat.rs, around line 341) strips the thinking scratchpad via strip_unclosed_primed_thinking and clean_structural_tokens and never surfaces it. So for any thinking model (qwen3 with <think>/</think> included), the non-streaming /v1/chat/completions response discards the reasoning entirely.
  • The streaming path does surface it: src/server/routes/chat.rs (around lines 729 and 793) emits delta.reasoning_content chunks via the StreamFilter for both the <think> (Qwen) and <|channel> (Gemma 4) families.

Net: the discarded thinking is a general gap in the non-streaming response shape, not a gemma-4-only extraction bug. It is worth tracking here because it compounds the user-visible impact for this model.

Impact

A default /v1/chat/completions request to gemma-4-12b-it with a typical max_tokens returns an empty or truncated answer, which looks like the model is broken when it is not. This affects the default serving experience for this model and is user-visible.

Directions to investigate (not prescriptive)

  1. Reconsider whether <|channel> detection should default enable_thinking=true for gemma4_unified, given the model over-thinks simple prompts, or whether the default should be off, opt-in, or at least documented.
  2. If thinking stays enabled, surface reasoning_content on the non-streaming response (the field is absent today) rather than silently stripping it, and document a recommended higher max_tokens for this model.
  3. Confirm whether the model actually emits the <|channel>thought ... <channel|> markers in its output (which bears on extraction), or whether the over-thinking is unmarked verbose generation that the marker-based stripper never matches.

Acceptance Criteria

  • A default /v1/chat/completions request to gemma-4-12b-it with a typical max_tokens (for example 256) returns a non-empty, coherent answer, achieved either by adjusting the thinking default for this model or by ensuring the answer fits within a normal budget.
  • The thinking content is either surfaced via reasoning_content (non-streaming included) or intentionally documented as stripped.
  • The chosen behavior (thinking default and recommended max_tokens) is documented for gemma4_unified.

Technical Considerations

  • The thinking default originates in set_default_enable_thinking(true) driven by infer_thinking_markers().has_thinking(). Any change should preserve the existing precedence: explicit per-request chat_template_kwargs.enable_thinking and the CLI/env defaults (--chat-template-kwargs, LLAMA_ARG_CHAT_TEMPLATE_KWARGS) must continue to win via merge_server_and_request.
  • Adding reasoning_content to the non-streaming ChatMessage would close the gap for every thinking family at once (qwen3 and gemma4), not just Gemma 4.

Provenance

Observed during validation of #350 / #351 and #333. Not introduced by those changes: the thinking default is intentional and mirrors ml-explore/mlx-lm#1114. Bare #350, #351, #333, and #348 are references to this repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)priority:mediumMedium prioritystatus:doneCompletedtype:bugBug fixes, error corrections, or issue resolutions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions