Running Mistral-7B on Intel NPU Achieves 12.6 Tokens/Second

1 min read

A developer has successfully created a tool to run LLMs on Intel's Neural Processing Unit (NPU), demonstrating practical local inference on specialized AI hardware. Testing with Mistral-7B in int4 quantization on a Core Ultra processor, the NPU achieved 12.6 tokens/second decode speed with 1.8s time-to-first-token while consuming 4.8GB of memory and crucially using zero CPU/GPU resources.

While the NPU performance is respectable, comparative benchmarks show the integrated GPU still leads at 23.38 tokens/second with faster 0.25s TTFT, and even CPU inference manages 9.04 tokens/second. However, the NPU's key advantage lies in freeing up CPU and GPU resources for other tasks while maintaining decent inference speeds.

This development is significant for edge deployment scenarios where power efficiency and resource allocation matter more than raw performance. The ability to offload LLM inference to dedicated NPU hardware while keeping CPU and GPU available for other workloads opens new possibilities for local LLM deployment architectures, particularly in multi-tasking environments or battery-powered devices.


Source: r/LocalLLaMA · Relevance: 8/10