Tagged "performance"

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference 27 May 2026
vLLM vs Ollama 2026: Performance Benchmark Reveals 9x Throughput Gap 25 May 2026
Users Report Superior Performance Switching from LM Studio to llama.cpp 25 May 2026
110 Tokens/Second on RTX 4070 Super with Qwen 3.6 35B 22 May 2026
llama.cpp Checkpoint Fix Accelerates Local Coding Agents 22 May 2026
llama.cpp Adds Multi-Token Prediction, Doubles Qwen 3.6B Throughput for Local Inference 19 May 2026
Bito's AI Architect Improves Claude Opus Task Success Rate by 35% 19 May 2026
Orthrus Reshapes Economics of Local AI Inference with New Optimization Approach 16 May 2026
ROCm 7.2.3 Delivers Performance Improvements Over 7.0.0 on AMD Radeon AI PRO 15 May 2026
Open-Source Local LLM Emerges as Viable Cloud AI Competitor 15 May 2026
Lucebox Brings Faster Local AI Inference to AMD Strix Halo 13 May 2026
Lython: Experimental Python Compiler Toolchain Based on LLVM 11 May 2026
One LM Studio Setting Change Makes Local LLMs Competitive With Cloud Models 11 May 2026
DFlash Speculative Decoding Delivers 8.5x Speed Improvement for LLM Inference 11 May 2026
Bun's Experimental Rust Rewrite Achieves 99.8% Test Compatibility on Linux 9 May 2026
Microsoft VibeVoice C++ Port Enables Local Voice AI on CPU and GPU Without Python 6 May 2026
Sarvam Edge: Indian-Built AI Models Run Offline on Phones and Laptops Without Internet 6 May 2026
Google Accelerates Gemma 4 Inference Speed 3x With Multi-Token Prediction Drafters 6 May 2026
llama.cpp Now Supports Multi-Token Prediction in Beta 5 May 2026
NIST's CAISI Evaluation of DeepSeek V4 Pro Finds It On Par with GPT-5 3 May 2026
Linux Setup for Local LLMs Takes Minutes Compared to Windows Hours 1 May 2026
Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp 28 April 2026
Show HN: We built an OCR server that can process 270 dense images/s on a 5090 23 April 2026
llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost 20 April 2026
Gemma 4 Just Replaced My Whole Local LLM Stack 19 April 2026
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation 18 April 2026
The 'Ollama' Tool Has Numerous Problems, and Some Argue That Llama.cpp Is Better 17 April 2026
Intel's $949 GPU Has 32GB of VRAM for Local AI, but the Software Is Why Nvidia Keeps Winning 17 April 2026
SigMap – Shrink AI Coding Context 97% with Auto-Scaling Token Budget 15 April 2026
DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max 15 April 2026
Running Same Prompts Through Claude and Local LLM Revealed Unexpected Results 13 April 2026
Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference 12 April 2026
DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon 12 April 2026
Speculative Decoding Made My Local LLM Actually Usable 9 April 2026
Your Next Assistant is Your PC: How On-Device AI is Transforming Work, One Workflow at a Time 7 April 2026
MemPalace, the Highest-Scoring AI Memory System Ever Benchmarked 7 April 2026
CricketBrain: Neuromorphic Signal Processor in Rust (0.175us/step, 944 bytes) 7 April 2026
Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups 5 April 2026
Microsoft Quantum Development Kit Ported to Rust: 100x Faster and Smaller 5 April 2026
Gemma 4 31B Outperforms GLM 5.1 in Real-World Testing 4 April 2026
TurboQuant: Understanding the Quantization Breakthrough 29 March 2026
Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference 29 March 2026
Mixed KV Cache Quantization: Performance Risks and Pitfalls 29 March 2026
Linux Significantly Outperforms Windows for Local LLM Inference 29 March 2026
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context 28 March 2026
RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra 27 March 2026
Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config 27 March 2026
Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU 26 March 2026
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching 26 March 2026
Google TurboQuant: Extreme Compression for Local LLM Deployment 25 March 2026
Rust Project Perspectives on AI 22 March 2026
Multi-Token Prediction support coming to MLX-LM for Qwen 3.5 21 March 2026
Snapdragon 8 Elite Gen 5 Hands the Galaxy S26 the AI Upgrade We've Been Waiting For 18 March 2026
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM 14 March 2026
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems 14 March 2026
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens 14 March 2026
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs 12 March 2026
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia 12 March 2026
FreeBSD 14.4 Released: Implications for Local LLM Deployment 10 March 2026
Mojo: Creating a Programming Language for an AI World with Chris Lattner 7 March 2026
The Emerging Role of SRAM-Centric Chips in AI Inference 6 March 2026
Apple M5 Pro and M5 Max: 4× Faster LLM Processing 4 March 2026
Qwen 3.5 vs Qwen 3 Benchmark Analysis: Generational Performance Improvements Visualized 3 March 2026
Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot 28 February 2026
Snapdragon 8 Elite Gen 5 Powers Galaxy S26 Series With Enhanced On-Device AI 27 February 2026
Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti 26 February 2026
Qwen 3.5 Underperforms on Hard Coding Tasks—APEX Benchmark Analysis 26 February 2026
Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup 26 February 2026
New Era of On-Device AI Driven by High-Speed UFS 5.0 Storage 25 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Taalas Etches AI Models onto Transistors to Rocket Boost Inference 21 February 2026
I Thought I Needed a GPU to Run AI Until I Learned About These Models 21 February 2026
24 Simultaneous Claude Code Agents on Local Hardware 21 February 2026