Tagged "inference-speed"

Show HN: We built an OCR server that can process 270 dense images/s on a 5090 23 April 2026
llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost 20 April 2026
LlaMa.cpp Robot Wars 19 April 2026
Unweight: Lossless MLP Weight Compression for LLM Inference 18 April 2026
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation 18 April 2026
DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max 15 April 2026
Fine-Tuned Qwen3.5-0.8B for OCR Outperforms Previous 2B Release 14 April 2026
oMLX Framework Implements DFlash Attention for Optimized Inference 14 April 2026
Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B 13 April 2026
On-Device AI: Achieving Powerful AI Capabilities Without Internet Connectivity 12 April 2026
Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference 12 April 2026
DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon 12 April 2026
The Best Local AI Model for Home Assistant Isn't Always the Biggest One 12 April 2026
Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B 11 April 2026
Google's Gemini Nano 4 Offers Faster, Smarter Local Inference Capabilities 11 April 2026
DMax: New Parallel Decoding Paradigm for Diffusion Language Models 11 April 2026
Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs 10 April 2026
Speculative Decoding Made My Local LLM Actually Usable 9 April 2026
TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration 7 April 2026
Gemma 4 26B Achieves Impressive Local Performance With Proper Configuration 7 April 2026
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache 6 April 2026
HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware 6 April 2026
Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups 5 April 2026
GMKtec NucBox K17 Launches with 97 TOPS AI Performance for Local Inference 5 April 2026
Mixed Precision Quantization on MLX with TurboQuant Implementation 4 April 2026
Kokoro TTS Achieves 20× Realtime Speed on CPU-Only On-Device Inference 4 April 2026
OpenUMA – Apple-Style Unified Memory for x86 AI Inference 3 April 2026
NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs 3 April 2026
Google Gemma 4 Released with GGUF Quantizations 3 April 2026
Gemma 4 26B A4B Outperforms Qwen 3.5 35B on Apple Silicon 3 April 2026
Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support 2 April 2026
TinyGPU Adds Mac Support for External Nvidia GPU Acceleration 2 April 2026
Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac 1 April 2026
Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains 1 April 2026
TurboQuant: Understanding the Quantization Breakthrough 29 March 2026
Linux Significantly Outperforms Windows for Local LLM Inference 29 March 2026
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context 28 March 2026
M5 Max Delivers 1.7x Faster Inference Than M3 Max on Qwen 3.5 Models 28 March 2026
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice 27 March 2026
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra 27 March 2026
Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config 27 March 2026
Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU 26 March 2026
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching 26 March 2026
Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared 25 March 2026
Critical: LiteLLM Supply Chain Attack Detected, Bifrost Alternative Released 25 March 2026