Tagged "llama-cpp"
- I built Rubric, an open source Sentry for AI. Looking for beta testers
- LM Studio Releases Reworked Plugins with Fully Local Web Research
- Llama.cpp ROCm 7 vs Vulkan Performance Benchmarks on AMD Mi50
- Rust Project Perspectives on AI
- Setting Up a Private AI Brain on Windows: Complete Guide to Local LLM Deployment
- ik_llama.cpp Fork Delivers 26x Faster Prompt Processing on Qwen 3.5 27B
- Careless Whisper – Personal Local Speech to Text
- Automating Read-It-Later Workflows with Local LLMs for Overnight Summarization
- Qualcomm and Samsung's 30-Year AI Alliance Enters a New Phase as On-Device AI Chip Race Heats Up
- What AI Augmentation Means for Technical Leaders
- Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
- LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
- Kilo Is the VS Code Extension That Actually Works With Every Local LLM I Throw At It
- Unsloth Studio: Open-Source Web UI for Training and Running LLMs Locally
- On-Device AI: Tether's QVAC Fabric Enables Local Training
- I Switched to a Local LLM for These 5 Tasks and the Cloud Version Hasn't Been Worth It Since
- LucidShark – Local-first, open-source quality and security gate
- You're Using Your Local LLM Wrong If You're Prompting It Like a Cloud LLM
- Hugging Face Releases One-Liner for Automatic Hardware Detection and Model Selection
- Run LLMs Locally with Llama.cpp
- I Ran Local LLMs on a 'Dead' GPU, and the Results Surprised Me
- Mistral Releases Small 4 Open-Source Model Under Apache 2.0
- How I Used Lima for an AI Coding Agent Sandbox
- Kimi Introduces Attention Residuals: 1.25x Compute Performance at <2% Overhead
- Practical Fix for Qwen 3.5 Overthinking in llama.cpp
- This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference
- Apple's On-Device AI Raises Privacy Alarms Across British Parliament
- AMD Declares 'AI on the PC Has Crossed an Important Line' – Agent Computers as Next Breakthrough
- OpenClaw vs Eigent vs Claude Cowork: Comparing Open-Source AI Collaboration Platforms
- Running Qwen3.5-27B Across Multiple GPUs Over LAN Achieves Practical Speed for Local Inference
- AMD Launches Agent System Optimized for Local AI Inference With Ryzen and Radeon
- Intel OpenVINO Backend Support Now Available in llama.cpp
- Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
- How to Run Local LLMs in 2026: The Complete Developer's Guide
- AgentArmor: Open-Source 8-Layer Security Framework for AI Agents
- 3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
- Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
- Nvidia Releases Nemotron 3 Super: 120B MoE Model for Local Deployment
- Llama.cpp Adds True Reasoning Budget Support
- Cutile.jl Brings Nvidia CUDA Tile-Based Programming to Julia
- SK Hynix Completes Qualification for LPDDR6 Memory Optimized for AI Inference
- Sarvam Open-Sources 30B and 105B Reasoning Models
- NVIDIA Jetson Brings Open Models to Life at the Edge
- LMF – LLM Markup Format
- Llama.cpp Celebrates Major Milestone: From Leak to Industry Standard
- Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
- Mnemos: Persistent Memory System for Local AI Agents
- FreeBSD 14.4 Released: Implications for Local LLM Deployment
- M5 Max and M5 Ultra Chipsets Demonstrate Significant Bandwidth Improvements for Local LLM Inference
- Community Survey: AI Content Automation Stacks in 2026
- Strix Halo (Ryzen AI Max+ 395) Achieves Strong Local Inference Performance with ROCm 7.2
- Sarvam Open-Sources 30B and 105B Reasoning Models
- Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
- HP Refreshes Lineup with AI-Focused Workstations
- Llama.cpp Merges Automatic Parser Generator to Mainline
- Turning Your Linux Terminal into a Local AI Assistant
- llama.cpp Merges Agentic Loop and MCP Client Support
- Apple Unveils MacBook Pro with M5 Pro and M5 Max Featuring On-Device AI
- OpenWrt 25.12.0 – Stable Release
- Apple Unveils MacBook Pro With M5 Pro and M5 Max for On-Device AI
- AMD Launches Copilot+ Desktop Chips to Compete in On-Device AI Market
- ÆTHERYA Core – Deterministic Policy Engine for Governing LLM Actions
- Qwen 3.5 Small Models Released: 0.8B to 9B Parameters Optimized for On-Device Inference
- Qwen 3.5 0.8B Successfully Deployed on 7-Year-Old Samsung S10E Using llama.cpp
- Framework Choice Critical: llama.cpp and vLLM Outperform Ollama for Qwen 3.5 Testing
- Critical: Qwen 3.5 Requires BF16 KV Cache, Not FP16 for Accurate Inference
- GitDelivr: A Free CDN for Git Clones Built on Cloudflare Workers and R2
- C7: Pipe Up-to-Date Library Docs Into Any LLM From the Terminal
- Huawei's SuperPoD Portfolio Creates New Option for Global Computing at MWC Barcelona 2026
- Unsloth Dynamic 2.0 GGUFs
- 5 Useful Docker Containers for Agentic Developers
- Seco Launches Edge AI System-on-Module at Embedded World 2026
- Arduino and Qualcomm Bring On-Device AI Learning to Indian Schools
- DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
- DeepSeek Paper – DualPath: Breaking the Bandwidth Bottleneck in LLM Inference
- Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
- Qwen3.5-27B Identified as Sweet Spot for Mid-Range Local Deployment
- Mirai Announces $10M to Advance On-Device AI Performance for Consumer Devices
- Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods
- How AI is Redefining Price and Performance in Modern Laptops
- Show HN: A Ground Up TLS 1.3 Client Written in C
- Enterprise Infrastructure Guide: Running Local LLMs for 70-150 Developers
- Apple Accelerates U.S. Manufacturing with Mac Mini Production
- nanollama: Open-Source Framework for Training Llama 3 from Scratch with One-Command GGUF Export
- Open-Source llama.cpp Finds Long-Term Home at Hugging Face
- Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
- Ouro 2.6B Thinking Model GGUFs Released with Q8_0 and Q4_K_M Quantization
- Strix Halo Performance Benchmarks: Minimax M2.5, Step 3.5 Flash, Qwen3 Coder
- I Thought I Needed a GPU to Run AI Until I Learned About These Models
- Open-Source + AI: ggml Joins Hugging Face, llama.cpp Stays Open—Local AI's Long-Term Home
- GGML.AI Acquired by Hugging Face
- PaddleOCR-VL Now Integrated into llama.cpp for Multilingual OCR
- Kitten TTS V0.8 Released: New State-of-the-Art Super-Tiny TTS Model Under 25 MB
- Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
- Kitten TTS V0.8 Released: State-of-the-Art Super-Tiny Text-to-Speech Model Under 25MB
- Self-Hosted AI: A Complete Roadmap for Beginners
- Meet Sarvam Edge: India's AI Model That Runs on Phones and Laptops With No Internet
- Qwen 3.5-397B-A17B Now Available for Local Inference with Aggressive Quantisation
- Open-Source Models Now Comprise 4 of Top 5 Most-Used Endpoints on OpenRouter
- Switching From Ollama And LM Studio To llama.cpp: A Performance Comparison
- SnowBall Technique Addresses Context Window Limitations in Local LLMs
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities
- MiniMax-M2.5 230B MoE Model Released with GGUF Support for Local Deployment
- GNOME's AI Assistant Newelle Adds llama.cpp Support and Command Execution
- Context Management Identified as Real Bottleneck in AI-Assisted Coding
- Switching From Ollama and LM Studio to llama.cpp: Performance Benefits
- Optimal llama.cpp Settings Found for Qwen3 Coder Next Loop Issues
- GitHub Announces Support for Open Source AI Project Maintainers
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
- New Header-Only C++ Benchmark Tool for Predictive Models on Raw Binary Streams
- Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance