Tagged "large-model-inference"
- Dynamic Expert Cache in llama.cpp Achieves 27% Faster Inference on Large MoE Models
- Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs
- Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups
- TinyGPU Adds Mac Support for External Nvidia GPU Acceleration
- Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning
- Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
- Apple M5 Max 128GB Benchmark Results for Local LLM Inference
- M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference
- Community Member Builds 144GB VRAM Local LLM Powerhouse