Problem / Background
gemma-4-12b-it-4bit is model_type: gemma4_unified (ModelType::Gemma4Unified). Its tokenizer carries the <|channel> / <channel|> added tokens, so on server startup tokenizer.infer_thinking_markers() resolves a think-marker pair through the multi-token (Gemma 4) branch and has_thinking() returns true. As a result the server flips the chat_template Jinja kwarg enable_thinking to true by default.
Verified code path:
src/tokenizer/mod.rs (infer_thinking_markers, around line 214). The multi-token branch (around line 242) returns a populated ThinkingMarkers when both <|channel> and <channel|> are present in the vocab, with think_start = "<|channel>thought" and think_end = "<channel|>".
src/server/startup.rs (around lines 1591 to 1622). When thinking_markers.has_thinking() is true the server calls chat_template.set_default_enable_thinking(true). The in-tree comment cites ml-explore/mlx-lm#1114 and the <|channel>thought / <channel|> markers "for Gemma 4 and friends".
- Per-request
chat_template_kwargs still win on conflict via merge_server_and_request (src/server/chat_template_kwargs.rs, around line 306), and OPEN_THINKING_SUFFIXES = ["<|channel>thought\n", "<think>\n"] (src/server/routes/chat.rs, around line 918) confirms the primed open-thinking block for both Gemma 4 and Qwen families.
This is the intended behavior introduced to mirror upstream ml-explore/mlx-lm#1114 (TokenizerWrapper.apply_chat_template defaulting enable_thinking=self.has_thinking). It is not a regression. It was observed while validating #350 / #351 (and #333), and is orthogonal to #350 (the placeholder-leak fix, now merged).
Symptom
With enable_thinking=true (the startup default for this model), gemma-4-12b-it produces very verbose thinking, so the answer content is truncated or empty at normal token budgets. Observed against gemma-4-12b-it-4bit on /v1/chat/completions, temperature 0:
- Prompt "What is a hash table? Answer in one sentence." at
max_tokens=80: empty content, finish_reason=length (the whole budget is consumed inside the thinking block).
- Same prompt at
max_tokens=400: a coherent one-sentence answer, but finish_reason=stop only after roughly 275 completion tokens, meaning about 245 tokens were spent thinking for a one-sentence answer.
- Prompt "Explain in three sentences how a hash table achieves average O(1) lookup." at
max_tokens=800: finish_reason=length, content truncated to roughly 33 characters (still not finished after 800 tokens).
- With
chat_template_kwargs={"enable_thinking": false}, the same prompts answer concisely and correctly with finish_reason=stop in well under 80 tokens.
So the verbosity and truncation are driven specifically by the thinking default, not by the model being broken.
Secondary observation (verified): reasoning_content is dropped on the non-streaming path
In the repro the non-streaming response carried no reasoning content even though about 245 thinking tokens were generated for the one-sentence case. This is confirmed at the code level, and it is not gemma-4-specific:
- The non-streaming response message type
ChatMessage (src/server/types/response.rs, around line 113) has only role, content, and tool_calls. There is no reasoning_content field. The non-streaming handler non_stream_chat_completion (src/server/routes/chat.rs, around line 341) strips the thinking scratchpad via strip_unclosed_primed_thinking and clean_structural_tokens and never surfaces it. So for any thinking model (qwen3 with <think>/</think> included), the non-streaming /v1/chat/completions response discards the reasoning entirely.
- The streaming path does surface it:
src/server/routes/chat.rs (around lines 729 and 793) emits delta.reasoning_content chunks via the StreamFilter for both the <think> (Qwen) and <|channel> (Gemma 4) families.
Net: the discarded thinking is a general gap in the non-streaming response shape, not a gemma-4-only extraction bug. It is worth tracking here because it compounds the user-visible impact for this model.
Impact
A default /v1/chat/completions request to gemma-4-12b-it with a typical max_tokens returns an empty or truncated answer, which looks like the model is broken when it is not. This affects the default serving experience for this model and is user-visible.
Directions to investigate (not prescriptive)
- Reconsider whether
<|channel> detection should default enable_thinking=true for gemma4_unified, given the model over-thinks simple prompts, or whether the default should be off, opt-in, or at least documented.
- If thinking stays enabled, surface
reasoning_content on the non-streaming response (the field is absent today) rather than silently stripping it, and document a recommended higher max_tokens for this model.
- Confirm whether the model actually emits the
<|channel>thought ... <channel|> markers in its output (which bears on extraction), or whether the over-thinking is unmarked verbose generation that the marker-based stripper never matches.
Acceptance Criteria
Technical Considerations
- The thinking default originates in
set_default_enable_thinking(true) driven by infer_thinking_markers().has_thinking(). Any change should preserve the existing precedence: explicit per-request chat_template_kwargs.enable_thinking and the CLI/env defaults (--chat-template-kwargs, LLAMA_ARG_CHAT_TEMPLATE_KWARGS) must continue to win via merge_server_and_request.
- Adding
reasoning_content to the non-streaming ChatMessage would close the gap for every thinking family at once (qwen3 and gemma4), not just Gemma 4.
Provenance
Observed during validation of #350 / #351 and #333. Not introduced by those changes: the thinking default is intentional and mirrors ml-explore/mlx-lm#1114. Bare #350, #351, #333, and #348 are references to this repository.
Problem / Background
gemma-4-12b-it-4bitismodel_type: gemma4_unified(ModelType::Gemma4Unified). Its tokenizer carries the<|channel>/<channel|>added tokens, so on server startuptokenizer.infer_thinking_markers()resolves a think-marker pair through the multi-token (Gemma 4) branch andhas_thinking()returns true. As a result the server flips thechat_templateJinja kwargenable_thinkingtotrueby default.Verified code path:
src/tokenizer/mod.rs(infer_thinking_markers, around line 214). The multi-token branch (around line 242) returns a populatedThinkingMarkerswhen both<|channel>and<channel|>are present in the vocab, withthink_start = "<|channel>thought"andthink_end = "<channel|>".src/server/startup.rs(around lines 1591 to 1622). Whenthinking_markers.has_thinking()is true the server callschat_template.set_default_enable_thinking(true). The in-tree comment citesml-explore/mlx-lm#1114and the<|channel>thought/<channel|>markers "for Gemma 4 and friends".chat_template_kwargsstill win on conflict viamerge_server_and_request(src/server/chat_template_kwargs.rs, around line 306), andOPEN_THINKING_SUFFIXES = ["<|channel>thought\n", "<think>\n"](src/server/routes/chat.rs, around line 918) confirms the primed open-thinking block for both Gemma 4 and Qwen families.This is the intended behavior introduced to mirror upstream
ml-explore/mlx-lm#1114(TokenizerWrapper.apply_chat_templatedefaultingenable_thinking=self.has_thinking). It is not a regression. It was observed while validating #350 / #351 (and #333), and is orthogonal to #350 (the placeholder-leak fix, now merged).Symptom
With
enable_thinking=true(the startup default for this model), gemma-4-12b-it produces very verbose thinking, so the answercontentis truncated or empty at normal token budgets. Observed againstgemma-4-12b-it-4biton/v1/chat/completions, temperature 0:max_tokens=80: emptycontent,finish_reason=length(the whole budget is consumed inside the thinking block).max_tokens=400: a coherent one-sentence answer, butfinish_reason=stoponly after roughly 275 completion tokens, meaning about 245 tokens were spent thinking for a one-sentence answer.max_tokens=800:finish_reason=length,contenttruncated to roughly 33 characters (still not finished after 800 tokens).chat_template_kwargs={"enable_thinking": false}, the same prompts answer concisely and correctly withfinish_reason=stopin well under 80 tokens.So the verbosity and truncation are driven specifically by the thinking default, not by the model being broken.
Secondary observation (verified): reasoning_content is dropped on the non-streaming path
In the repro the non-streaming response carried no reasoning content even though about 245 thinking tokens were generated for the one-sentence case. This is confirmed at the code level, and it is not gemma-4-specific:
ChatMessage(src/server/types/response.rs, around line 113) has onlyrole,content, andtool_calls. There is noreasoning_contentfield. The non-streaming handlernon_stream_chat_completion(src/server/routes/chat.rs, around line 341) strips the thinking scratchpad viastrip_unclosed_primed_thinkingandclean_structural_tokensand never surfaces it. So for any thinking model (qwen3 with<think>/</think>included), the non-streaming/v1/chat/completionsresponse discards the reasoning entirely.src/server/routes/chat.rs(around lines 729 and 793) emitsdelta.reasoning_contentchunks via theStreamFilterfor both the<think>(Qwen) and<|channel>(Gemma 4) families.Net: the discarded thinking is a general gap in the non-streaming response shape, not a gemma-4-only extraction bug. It is worth tracking here because it compounds the user-visible impact for this model.
Impact
A default
/v1/chat/completionsrequest to gemma-4-12b-it with a typicalmax_tokensreturns an empty or truncated answer, which looks like the model is broken when it is not. This affects the default serving experience for this model and is user-visible.Directions to investigate (not prescriptive)
<|channel>detection should defaultenable_thinking=trueforgemma4_unified, given the model over-thinks simple prompts, or whether the default should be off, opt-in, or at least documented.reasoning_contenton the non-streaming response (the field is absent today) rather than silently stripping it, and document a recommended highermax_tokensfor this model.<|channel>thought...<channel|>markers in its output (which bears on extraction), or whether the over-thinking is unmarked verbose generation that the marker-based stripper never matches.Acceptance Criteria
/v1/chat/completionsrequest to gemma-4-12b-it with a typicalmax_tokens(for example 256) returns a non-empty, coherent answer, achieved either by adjusting the thinking default for this model or by ensuring the answer fits within a normal budget.reasoning_content(non-streaming included) or intentionally documented as stripped.max_tokens) is documented forgemma4_unified.Technical Considerations
set_default_enable_thinking(true)driven byinfer_thinking_markers().has_thinking(). Any change should preserve the existing precedence: explicit per-requestchat_template_kwargs.enable_thinkingand the CLI/env defaults (--chat-template-kwargs,LLAMA_ARG_CHAT_TEMPLATE_KWARGS) must continue to win viamerge_server_and_request.reasoning_contentto the non-streamingChatMessagewould close the gap for every thinking family at once (qwen3 and gemma4), not just Gemma 4.Provenance
Observed during validation of #350 / #351 and #333. Not introduced by those changes: the thinking default is intentional and mirrors
ml-explore/mlx-lm#1114. Bare #350, #351, #333, and #348 are references to this repository.