preserve_thinking vs enable_thinking — Investigation Findings¶
Date: 2026-05-24
Ticket: gpumod-qgy
Decision: Hermes-agent (gpumod-aop) will use preserve_thinking:true in production. Benchmarked numbers from result_qwen36-35b-a3b-mtp-iq4xs.json (enable_thinking:true) carry over.
Question¶
Does --chat-template-kwargs '{"preserve_thinking":true}' change Qwen3.6 MTP behavior vs --chat-template-kwargs '{"enable_thinking":true}' on a single-shot prompt? The 76l.3 benchmark used enable_thinking:true; we want to know whether the just-measured numbers carry over if production swaps to preserve_thinking:true for multi-turn Hermes-agent workflows.
Method¶
- Extract the Qwen3.6 chat template embedded in the MTP GGUF (
tokenizer.chat_templatemetadata key). - Trace where each kwarg is consumed in the template's Jinja flow.
- Verify llama.cpp's C++ side does not separately interpret either flag.
- Cross-check Unsloth + HuggingFace docs for any documented MTP × thinking-flag interaction.
Findings¶
1. Chat template proof¶
Extracted via direct GGUF parse from ~/bin/Qwen3.6-35B-A3B-MTP-UD-IQ4_XS.gguf. The two relevant branches:
Per-message loop (only fires for PRIOR assistant turns):
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- set reasoning_content = reasoning_content|trim %}
{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
The preserve_thinking is true test is reachable only inside the assistant branch, which only iterates when the conversation has at least one prior assistant message. On a single-shot prompt (messages = [user]), the loop never enters this branch.
Generation prompt block (always emitted, governed only by enable_thinking):
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}
{%- endif %}
preserve_thinking is not referenced here. Whether the model is told to think on the new turn is decided by enable_thinking alone (defaults to "yes" when undefined).
2. C++ source confirms no inference-layer special-casing¶
In llama.cpp/common/chat.h:
- enable_thinking is a typed field on common_chat_templates_inputs (line 190)
- chat_template_kwargs is an opaque std::map<std::string, std::string> (line 192) — passes through to the Jinja renderer untouched
A repository-wide grep for preserve_thinking across the C++ tree returns zero hits. The flag exists only inside the template — there is no inference-layer or speculative-decoding-layer code that branches on it.
3. Documentation review (no contradicting evidence)¶
- Unsloth Qwen3.6 docs: "preserve_thinking leaves the thinking trace from the previous conversation… increases the number of tokens you use, but could increase accuracy in continued conversations." Qualified language; no measured numbers; no single-shot effect mentioned.
- HuggingFace model card (
unsloth/Qwen3.6-35B-A3B-MTP-GGUF): no recommendedchat-template-kwargsfor MTP; no mention of preserve_thinking; no MTP × thinking-flag interaction documented.
Conclusion¶
For a single-shot prompt with the Qwen3.6-MTP chat template:
| Aspect | enable_thinking:true |
preserve_thinking:true |
|---|---|---|
| User-message rendering | identical | identical |
| Per-message loop branches taken | none (no prior assistant) | none (no prior assistant) |
| Generation prompt suffix | <\|im_start\|>assistant\n<think>\n |
<\|im_start\|>assistant\n<think>\n |
| MTP draft head input | identical | identical |
| MTP draft head behavior | identical (no code branches on either flag) | identical |
| Expected output distribution | same | same (modulo sampling stochasticity) |
Single-shot equivalence is proven by construction, not by empirical measurement.
The two flags diverge only on continuation turns where a prior assistant message carries reasoning_content:
- enable_thinking:true — prior <think> is dropped from the templated history
- preserve_thinking:true — prior <think> is included verbatim in the templated history
This is exactly the multi-turn benefit Unsloth documents ("could increase accuracy in continued conversations").
Decision for gpumod-aop¶
Hermes-agent (chat + tool-calling) is multi-turn by nature. The benchmarked numbers from result_qwen36-35b-a3b-mtp-iq4xs.json carry over unchanged for production use with preserve_thinking:true. The swap therefore lands with the agent-appropriate flag rather than the benchmark-time flag, with zero regression risk on single-shot quality.
Concretely, presets/llm/qwen36-35b-a3b-mtp-iq4xs.yaml gets its extra_args updated:
- extra_args: "--parallel 1 --threads 16 --spec-type draft-mtp --spec-draft-n-max 2 --chat-template-kwargs '{\"enable_thinking\":true}'"
+ extra_args: "--parallel 1 --threads 16 --spec-type draft-mtp --spec-draft-n-max 2 --chat-template-kwargs '{\"preserve_thinking\":true}'"
The change is captured under gpumod-aop, not under this spike.
What was NOT proven¶
This investigation only proves single-shot equivalence. The actual value of preserve_thinking:true (accuracy lift on multi-turn agent traces) was not measured — the 20260524 benchmark's single-shot methodology cannot speak to that. If a future ticket wants to quantify multi-turn benefit, it needs a multi-turn evaluation harness, not the v2 coding benchmark.
Smoke check (runtime confirmation)¶
The template proof handles correctness. A 1-iteration smoke is the only remaining defensive check — confirms the service starts cleanly with the flag in extra_args and serves a valid response. See gpumod-qgy notes for the smoke runbook.
References¶
- Unsloth Qwen3.6 docs — Thinking, Disable & Preserve Thinking
- unsloth/Qwen3.6-35B-A3B-MTP-GGUF model card
- llama.cpp source:
common/chat.h:190-192,common/chat.cpp:548-553 - Chat template extraction:
/tmp/extract_template.pyagainst the MTP GGUF - Baseline benchmark: 20260524 Qwen3.6 MTP vs Non-MTP