Tagged "performance-optimization"
- Qwen 3.5 Models: Optimal Settings and Reduced Overthinking Configuration
- Llama.cpp ROCm 7 vs Vulkan Performance Benchmarks on AMD Mi50
- How to Build a Self-Hosted AI Server with LM Studio: Step-by-Step Guide
- ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B
- Multi-Token Prediction support coming to MLX-LM for Qwen 3.5
- DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide
- Mamba 3: State Space Model Architecture Optimized for Inference
- Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware
- A New Magnetic Material for the AI Era
- Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
- Show HN: Merrilin.ai – Code Blocks in Your Books, Finally
- Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel
- Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference
- Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
- Intel OpenVINO Backend Support Now Available in llama.cpp
- Best Local LLM Models 2026: Developer Comparison
- Llama.cpp Adds True Reasoning Budget Support
- Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
- Simple Layer Duplication Technique Achieves Top Open LLM Leaderboard Performance
- 8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems
- FreeBSD 14.4 Released: Implications for Local LLM Deployment
- When Running Ollama on Your PC for Local AI, One Thing Matters More Than Most
- Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
- Show HN: Asterode – Multi-Model AI App with Memory and Power Features
- HyperExcel Seeks 150 Billion Won Series B to Scale LPU and Verda in Korea
- Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI
- Intel Arc Pro B70 Workstation GPU Confirmed via vLLM AI Release Notes
- Apple Neural Engine Reverse-Engineered for Local Model Training on Mac Mini M4
- AMD Expands Ryzen AI 400 Series Portfolio for Consumer and Enterprise AI PC Options
- Switch Qwen 3.5 Thinking Mode On/Off Without Model Reload Using setParamsByID
- Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal
- LLmFit: One-Command Hardware-Aware Model Selection Across 497 Models and 133 Providers
- Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080
- Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot
- Building a Privacy-Preserving RAG System in the Browser
- DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
- The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production
- Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
- Show HN: A Human-Curated, CLI-Driven Context Layer for AI Agents
- What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup
- Kioxia Sampling UFS 5.0 Embedded Flash Memory for Next-Generation Mobile Applications
- Enhanced Interface Speed Enables High-Performance On-Device AI Features in Smartphones
- Open-Source Framework Achieves Gemini 3 Deep Think Level Performance Through Local Model Scaffolding
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- Open-Source + AI: ggml Joins Hugging Face, llama.cpp Stays Open—Local AI's Long-Term Home
- The Path to Ubiquitous AI (17k tokens/sec)
- NVIDIA Releases Dynamo v0.9.0: Infrastructure Overhaul With FlashIndexer and Multi-Modal Support
- GPT4All Replaces Ollama On Mac After Quick Trial
- Cloudflare Releases Agents SDK v0.5.0 with Rust-Powered Infire Engine for Edge Inference
- AMD Announces Day 0 Support for Qwen 3.5 LLM on Instinct GPUs
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- Memio Launches AI-Powered Knowledge Hub for Android with Local Processing
- NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics
- Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance
- Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data