Tagged "performance"
- Hipfire: A Rust-Native AMD Inference Engine That Outperforms llama.cpp
- Show HN: We built an OCR server that can process 270 dense images/s on a 5090
- llama.cpp Merges Speculative Checkpointing for Major Inference Speed Boost
- Gemma 4 Just Replaced My Whole Local LLM Stack
- Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
- The 'Ollama' Tool Has Numerous Problems, and Some Argue That Llama.cpp Is Better
- Intel's $949 GPU Has 32GB of VRAM for Local AI, but the Software Is Why Nvidia Keeps Winning
- SigMap – Shrink AI Coding Context 97% with Auto-Scaling Token Budget
- DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max
- Running Same Prompts Through Claude and Local LLM Revealed Unexpected Results
- Google Gemma 4 Delivers Exceptional Speed and Accuracy for Local Inference
- DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon
- Speculative Decoding Made My Local LLM Actually Usable
- Your Next Assistant is Your PC: How On-Device AI is Transforming Work, One Workflow at a Time
- MemPalace, the Highest-Scoring AI Memory System Ever Benchmarked
- CricketBrain: Neuromorphic Signal Processor in Rust (0.175us/step, 944 bytes)
- Ollama Gets Blazing Fast on Macs with Full MLX Support and 2× Speedups
- Microsoft Quantum Development Kit Ported to Rust: 100x Faster and Smaller
- Gemma 4 31B Outperforms GLM 5.1 in Real-World Testing
- TurboQuant: Understanding the Quantization Breakthrough
- Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference
- Mixed KV Cache Quantization: Performance Risks and Pitfalls
- Linux Significantly Outperforms Windows for Local LLM Inference
- TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
- RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- Liquid AI's LFM2-24B Achieves 50 Tokens/Second in Web Browser via WebGPU
- Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
- Google TurboQuant: Extreme Compression for Local LLM Deployment
- Rust Project Perspectives on AI
- Multi-Token Prediction support coming to MLX-LM for Qwen 3.5
- Snapdragon 8 Elite Gen 5 Hands the Galaxy S26 the AI Upgrade We've Been Waiting For
- P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
- Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
- 3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
- Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
- Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
- FreeBSD 14.4 Released: Implications for Local LLM Deployment
- Mojo: Creating a Programming Language for an AI World with Chris Lattner
- The Emerging Role of SRAM-Centric Chips in AI Inference
- Apple M5 Pro and M5 Max: 4× Faster LLM Processing
- Qwen 3.5 vs Qwen 3 Benchmark Analysis: Generational Performance Improvements Visualized
- Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot
- Snapdragon 8 Elite Gen 5 Powers Galaxy S26 Series With Enhanced On-Device AI
- Qwen 3.5 MoE Delivers 100K Context Window at 40+ TPS on RTX 5060 Ti
- Qwen 3.5 Underperforms on Hard Coding Tasks—APEX Benchmark Analysis
- Qwen3.5 122B Achieves 25 tok/s on 72GB VRAM Setup
- New Era of On-Device AI Driven by High-Speed UFS 5.0 Storage
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- Taalas Etches AI Models onto Transistors to Rocket Boost Inference
- I Thought I Needed a GPU to Run AI Until I Learned About These Models
- 24 Simultaneous Claude Code Agents on Local Hardware