Tagged "inference-speed"
- Show HN: We built an OCR server that can process 270 dense images/s on a 5090
- llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost
- LlaMa.cpp Robot Wars
- Unweight: Lossless MLP Weight Compression for LLM Inference
- Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
- DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max
- Fine-Tuned Qwen3.5-0.8B for OCR Outperforms Previous 2B Release
- oMLX Framework Implements DFlash Attention for Optimized Inference
- Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
- On-Device AI: Achieving Powerful AI Capabilities Without Internet Connectivity
- Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference
- DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon
- The Best Local AI Model for Home Assistant Isn't Always the Biggest One
- Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
- Google's Gemini Nano 4 Offers Faster, Smarter Local Inference Capabilities
- DMax: New Parallel Decoding Paradigm for Diffusion Language Models
- Qwen 3.5 122B Achieves 198 Tokens/sec on Dual RTX PRO 6000 Blackwell GPUs
- Speculative Decoding Made My Local LLM Actually Usable
- TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration
- Gemma 4 26B Achieves Impressive Local Performance With Proper Configuration
- TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
- HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware
- Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups
- GMKtec NucBox K17 Launches with 97 TOPS AI Performance for Local Inference
- Mixed Precision Quantization on MLX with TurboQuant Implementation
- Kokoro TTS Achieves 20× Realtime Speed on CPU-Only On-Device Inference
- OpenUMA – Apple-Style Unified Memory for x86 AI Inference
- NVIDIA Accelerates Gemma 4 for Local Agentic AI on RTX GPUs
- Google Gemma 4 Released with GGUF Quantizations
- Gemma 4 26B A4B Outperforms Qwen 3.5 35B on Apple Silicon
- Apple Silicon Macs Run Local AI Faster with Ollama's New MLX Support
- TinyGPU Adds Mac Support for External Nvidia GPU Acceleration
- Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac
- Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains
- TurboQuant: Understanding the Quantization Breakthrough
- Linux Significantly Outperforms Windows for Local LLM Inference
- TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
- M5 Max Delivers 1.7x Faster Inference Than M3 Max on Qwen 3.5 Models
- TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
- RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU
- Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
- Llama.cpp Benchmark: RTX 5090 vs Enterprise Systems Compared
- Critical: LiteLLM Supply Chain Attack Detected, Bifrost Alternative Released