Critical: Qwen 3.5 Requires BF16 KV Cache, Not FP16 for Accurate Inference
1 min readA critical technical discovery highlights an important compatibility issue for local Qwen 3.5 deployment. Daniel Han has documented that the Qwen 3.5 35B model requires bfloat16 KV cache precision rather than the standard float16 that inference engines like llama.cpp use by default. Users need to explicitly set -ctk bf16 -ctv bf16 flags to maintain model accuracy.
The verification was rigorous—perplexity measurements on WikiText-2-raw demonstrated the impact of using incorrect KV cache precision, with the author specifically avoiding KL divergence metrics to ensure reproducibility. This matters significantly because KV cache represents the memory footprint bottleneck in long-context inference, and using the wrong precision can degrade output quality while sometimes appearing to save memory.
For practitioners already running Qwen 3.5 models, this is a critical configuration requirement. The discovery underscores the importance of community communication around model-specific optimization details that aren't always documented in official releases.
Source: r/LocalLLaMA · Relevance: 8/10