DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon

1 min read

A developer has implemented DFlash speculative decoding natively in MLX for Apple Silicon, achieving a 3.3x throughput improvement on Qwen 3.5-9B models running on M5 Max hardware. The implementation uses a small draft model to generate 16 tokens in parallel via block diffusion, which a target model then verifies in a single forward pass, with output remaining bit-for-bit identical to baseline greedy decoding.

Achieving 85 tokens per second on consumer-grade Apple Silicon represents a significant breakthrough for local deployment on MacBook Pro and Mac Studio systems. The technique is algorithmic rather than requiring specialized hardware, making it widely applicable across existing MLX deployments. This approach opens new possibilities for real-time inference on Apple's ecosystem without requiring larger or more power-hungry models.

The speculative decoding methodology represents a growing class of inference optimization techniques that maintain output fidelity while dramatically reducing latency, making interactive local inference more practical on resource-constrained devices.


Source: r/LocalLLaMA · Relevance: 9/10