fix(server): gemma-4-12b-it defaults to enable_thinking, yielding verbose thinking and truncated/empty answers at normal max_tokens

## Problem / Background

`gemma-4-12b-it-4bit` is `model_type: gemma4_unified` (`ModelType::Gemma4Unified`). Its tokenizer carries the `<|channel>` / `<channel|>` added tokens, so on server startup `tokenizer.infer_thinking_markers()` resolves a think-marker pair through the multi-token (Gemma 4) branch and `has_thinking()` returns true. As a result the server flips the `chat_template` Jinja kwarg `enable_thinking` to `true` by default.

Verified code path:

- `src/tokenizer/mod.rs` (`infer_thinking_markers`, around line 214). The multi-token branch (around line 242) returns a populated `ThinkingMarkers` when both `<|channel>` and `<channel|>` are present in the vocab, with `think_start = "<|channel>thought"` and `think_end = "<channel|>"`.
- `src/server/startup.rs` (around lines 1591 to 1622). When `thinking_markers.has_thinking()` is true the server calls `chat_template.set_default_enable_thinking(true)`. The in-tree comment cites `ml-explore/mlx-lm#1114` and the `<|channel>thought` / `<channel|>` markers "for Gemma 4 and friends".
- Per-request `chat_template_kwargs` still win on conflict via `merge_server_and_request` (`src/server/chat_template_kwargs.rs`, around line 306), and `OPEN_THINKING_SUFFIXES = ["<|channel>thought\n", "<think>\n"]` (`src/server/routes/chat.rs`, around line 918) confirms the primed open-thinking block for both Gemma 4 and Qwen families.

This is the intended behavior introduced to mirror upstream `ml-explore/mlx-lm#1114` (`TokenizerWrapper.apply_chat_template` defaulting `enable_thinking=self.has_thinking`). It is not a regression. It was observed while validating #350 / #351 (and #333), and is orthogonal to #350 (the placeholder-leak fix, now merged).

## Symptom

With `enable_thinking=true` (the startup default for this model), gemma-4-12b-it produces very verbose thinking, so the answer `content` is truncated or empty at normal token budgets. Observed against `gemma-4-12b-it-4bit` on `/v1/chat/completions`, temperature 0:

- Prompt "What is a hash table? Answer in one sentence." at `max_tokens=80`: empty `content`, `finish_reason=length` (the whole budget is consumed inside the thinking block).
- Same prompt at `max_tokens=400`: a coherent one-sentence answer, but `finish_reason=stop` only after roughly 275 completion tokens, meaning about 245 tokens were spent thinking for a one-sentence answer.
- Prompt "Explain in three sentences how a hash table achieves average O(1) lookup." at `max_tokens=800`: `finish_reason=length`, `content` truncated to roughly 33 characters (still not finished after 800 tokens).
- With `chat_template_kwargs={"enable_thinking": false}`, the same prompts answer concisely and correctly with `finish_reason=stop` in well under 80 tokens.

So the verbosity and truncation are driven specifically by the thinking default, not by the model being broken.

## Secondary observation (verified): reasoning_content is dropped on the non-streaming path

In the repro the non-streaming response carried no reasoning content even though about 245 thinking tokens were generated for the one-sentence case. This is confirmed at the code level, and it is not gemma-4-specific:

- The non-streaming response message type `ChatMessage` (`src/server/types/response.rs`, around line 113) has only `role`, `content`, and `tool_calls`. There is no `reasoning_content` field. The non-streaming handler `non_stream_chat_completion` (`src/server/routes/chat.rs`, around line 341) strips the thinking scratchpad via `strip_unclosed_primed_thinking` and `clean_structural_tokens` and never surfaces it. So for any thinking model (qwen3 with `<think>`/`</think>` included), the non-streaming `/v1/chat/completions` response discards the reasoning entirely.
- The streaming path does surface it: `src/server/routes/chat.rs` (around lines 729 and 793) emits `delta.reasoning_content` chunks via the `StreamFilter` for both the `<think>` (Qwen) and `<|channel>` (Gemma 4) families.

Net: the discarded thinking is a general gap in the non-streaming response shape, not a gemma-4-only extraction bug. It is worth tracking here because it compounds the user-visible impact for this model.

## Impact

A default `/v1/chat/completions` request to gemma-4-12b-it with a typical `max_tokens` returns an empty or truncated answer, which looks like the model is broken when it is not. This affects the default serving experience for this model and is user-visible.

## Directions to investigate (not prescriptive)

1. Reconsider whether `<|channel>` detection should default `enable_thinking=true` for `gemma4_unified`, given the model over-thinks simple prompts, or whether the default should be off, opt-in, or at least documented.
2. If thinking stays enabled, surface `reasoning_content` on the non-streaming response (the field is absent today) rather than silently stripping it, and document a recommended higher `max_tokens` for this model.
3. Confirm whether the model actually emits the `<|channel>thought` ... `<channel|>` markers in its output (which bears on extraction), or whether the over-thinking is unmarked verbose generation that the marker-based stripper never matches.

## Acceptance Criteria

- [x] A default `/v1/chat/completions` request to gemma-4-12b-it with a typical `max_tokens` (for example 256) returns a non-empty, coherent answer, achieved either by adjusting the thinking default for this model or by ensuring the answer fits within a normal budget.
- [x] The thinking content is either surfaced via `reasoning_content` (non-streaming included) or intentionally documented as stripped.
- [x] The chosen behavior (thinking default and recommended `max_tokens`) is documented for `gemma4_unified`.

## Technical Considerations

- The thinking default originates in `set_default_enable_thinking(true)` driven by `infer_thinking_markers().has_thinking()`. Any change should preserve the existing precedence: explicit per-request `chat_template_kwargs.enable_thinking` and the CLI/env defaults (`--chat-template-kwargs`, `LLAMA_ARG_CHAT_TEMPLATE_KWARGS`) must continue to win via `merge_server_and_request`.
- Adding `reasoning_content` to the non-streaming `ChatMessage` would close the gap for every thinking family at once (qwen3 and gemma4), not just Gemma 4.

## Provenance

Observed during validation of #350 / #351 and #333. Not introduced by those changes: the thinking default is intentional and mirrors `ml-explore/mlx-lm#1114`. Bare #350, #351, #333, and #348 are references to this repository.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): gemma-4-12b-it defaults to enable_thinking, yielding verbose thinking and truncated/empty answers at normal max_tokens #352

Problem / Background

Symptom

Secondary observation (verified): reasoning_content is dropped on the non-streaming path

Impact

Directions to investigate (not prescriptive)

Acceptance Criteria

Technical Considerations

Provenance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fix(server): gemma-4-12b-it defaults to enable_thinking, yielding verbose thinking and truncated/empty answers at normal max_tokens #352

Description

Problem / Background

Symptom

Secondary observation (verified): reasoning_content is dropped on the non-streaming path

Impact

Directions to investigate (not prescriptive)

Acceptance Criteria

Technical Considerations

Provenance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions