Tagged "performance-optimization"

Qwen 3.5 Models: Optimal Settings and Reduced Overthinking Configuration 23 March 2026
Llama.cpp ROCm 7 vs Vulkan Performance Benchmarks on AMD Mi50 23 March 2026
How to Build a Self-Hosted AI Server with LM Studio: Step-by-Step Guide 23 March 2026
ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B 22 March 2026
Multi-Token Prediction support coming to MLX-LM for Qwen 3.5 21 March 2026
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide 21 March 2026
Mamba 3: State Space Model Architecture Optimized for Inference 18 March 2026
Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware 18 March 2026
A New Magnetic Material for the AI Era 17 March 2026
Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead 17 March 2026
Show HN: Merrilin.ai – Code Blocks in Your Books, Finally 16 March 2026
Qwen3.5-397B Achieves 282 tok/s on 4x RTX PRO 6000 Blackwell Through Custom CUTLASS Kernel 15 March 2026
Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference 15 March 2026
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage 15 March 2026
Intel OpenVINO Backend Support Now Available in llama.cpp 14 March 2026
Best Local LLM Models 2026: Developer Comparison 14 March 2026
Llama.cpp Adds True Reasoning Budget Support 12 March 2026
Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia 12 March 2026
Simple Layer Duplication Technique Achieves Top Open LLM Leaderboard Performance 11 March 2026
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems 10 March 2026
FreeBSD 14.4 Released: Implications for Local LLM Deployment 10 March 2026
When Running Ollama on Your PC for Local AI, One Thing Matters More Than Most 9 March 2026
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide 8 March 2026
Show HN: Asterode – Multi-Model AI App with Memory and Power Features 7 March 2026
HyperExcel Seeks 150 Billion Won Series B to Scale LPU and Verda in Korea 6 March 2026
Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI 4 March 2026
Intel Arc Pro B70 Workstation GPU Confirmed via vLLM AI Release Notes 3 March 2026
Apple Neural Engine Reverse-Engineered for Local Model Training on Mac Mini M4 2 March 2026
AMD Expands Ryzen AI 400 Series Portfolio for Consumer and Enterprise AI PC Options 2 March 2026
Switch Qwen 3.5 Thinking Mode On/Off Without Model Reload Using setParamsByID 1 March 2026
Qwen3.5-35B RTX 5080 Experiments Confirm KV q8_0 as Free Lunch, Q4_K_M Remains Optimal 28 February 2026
LLmFit: One-Command Hardware-Aware Model Selection Across 497 Models and 133 Providers 28 February 2026
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080 28 February 2026
Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot 28 February 2026
Building a Privacy-Preserving RAG System in the Browser 26 February 2026
DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference 26 February 2026
The Complete Developer's Guide to Running LLMs Locally: From Ollama to Production 26 February 2026
Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization 25 February 2026
Show HN: A Human-Curated, CLI-Driven Context Layer for AI Agents 25 February 2026
What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup 25 February 2026
Kioxia Sampling UFS 5.0 Embedded Flash Memory for Next-Generation Mobile Applications 24 February 2026
Enhanced Interface Speed Enables High-Performance On-Device AI Features in Smartphones 24 February 2026
Open-Source Framework Achieves Gemini 3 Deep Think Level Performance Through Local Model Scaffolding 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Open-Source + AI: ggml Joins Hugging Face, llama.cpp Stays Open—Local AI's Long-Term Home 21 February 2026
The Path to Ubiquitous AI (17k tokens/sec) 20 February 2026
NVIDIA Releases Dynamo v0.9.0: Infrastructure Overhaul With FlashIndexer and Multi-Modal Support 20 February 2026
GPT4All Replaces Ollama On Mac After Quick Trial 19 February 2026
Cloudflare Releases Agents SDK v0.5.0 with Rust-Powered Infire Engine for Edge Inference 18 February 2026
AMD Announces Day 0 Support for Qwen 3.5 LLM on Instinct GPUs 18 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 14 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 13 February 2026
Memio Launches AI-Powered Knowledge Hub for Android with Local Processing 12 February 2026
NAS System Achieves 18 tok/s with 80B LLM Using Only Integrated Graphics 11 February 2026
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance 11 February 2026
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data 11 February 2026