Tagged "inference"
- Why the Same LLM Gives Different Answers in Different Environments
- What Type of AI Usage? Deployment Patterns and Implementation Considerations
- Blueprint: AI Hardware Design
- Using a Local LLM as a Zero-Shot Classifier
- I Cancelled Codex Two Months Ago. Opus 4.7 Brought Me Back
- Llama.cpp's Auto Fit Feature Quietly Reshapes Local AI Inference on Consumer Hardware
- Google's Gemma 4 Finally Makes Local LLM Deployment Compelling for Practitioners
- go-AI: New Inference API Library for Go Released
- Running DeepSeek R1 Locally: Your Complete Setup Guide
- Gemma 4 Just Replaced My Whole Local LLM Stack
- oMLX Framework Implements DFlash Attention for Optimized Inference
- VoxCPM2: New Open-Source TTS Model with Voice Cloning and Design
- GitHub Copilot CLI Adds Support for BYOK and Local Model Deployment
- Quansloth Using Google's Turboquant Breaks the VRAM Wall for Local LLMs
- TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
- Show HN: Lightweight LLM Tracing Tool with CLI
- GPU Memory for LLM Inference (Part 1)
- Gemma 4 Shows Strong Reasoning Performance with Thinking Tokens
- AMD Provides Day 0 Support for Gemma 4 on Ryzen AI Processors and GPUs
- Satcove – Query 5 AI Models Simultaneously and Get Structured Verdicts
- DeepSeek V3 Complete Guide: Deploy and Optimize Local AI in 2026
- DeepSeek-R1 Chain-of-Thought Debugging: A Developer's Guide
- Intel Launches Arc Pro B70/B65 with 32GB VRAM for Local AI Inference
- Powerful AI Search Engine Built on Single GeForce RTX 5090
- Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
- Repurpose Old GPUs as Dedicated AI Inference Accelerators
- Llamafile 0.10 Released with GPU Support and Rebuilt Core
- Unsloth Studio: Open-Source Web UI for Training and Running LLMs Locally
- Mamba 3: State Space Model Architecture Optimized for Inference
- You're Using Your Local LLM Wrong If You're Prompting It Like a Cloud LLM
- Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware
- This External GPU Enclosure Tries to Break Cloud Dependence for Local AI Inference
- Show HN: Buxo.ai – Calendly alternative where LLM decides which slots to show
- I made Karpathy's Autoresearch work on CPU
- Sarvam Open-Sources 30B and 105B Reasoning Models
- Reverse engineering a DOS game with no source code using Codex 5.4
- Windows 11 Notepad to Feature On-Device AI Text Generation Without Subscription
- llama-swap Emerges as Superior Alternative to Ollama and LM-Studio
- HP ZBook Ultra 14 G1a Workstation Reclaims Local AI Workflows for Professionals
- AMD Expands Ryzen AI 400 Series Portfolio for Consumer and Enterprise AI PC Options
- Accuracy vs. Speed in Local LLMs: Finding Your Sweet Spot
- DeepSeek Releases DualPath: Addressing Storage Bandwidth Bottlenecks in Agentic Inference
- Qwen3.5 Thinking Mode Can Be Disabled for Production Inference Optimization
- Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation
- Qwen3-Code-Next Proves Practical for Local Development: Real-World Coding Tasks on Mac Studio
- Custom Portable Workstation Optimized for Local AI Inference Builds
- Open-Source llama.cpp Finds Long-Term Home at Hugging Face
- GPT-OSS 20B Demonstrates Practical Agentic Capabilities Running Fully Locally
- AI-Powered Reverse-Engineering of Rosetta 2 for Linux
- Ouro 2.6B Thinking Model GGUFs Released with Q8_0 and Q4_K_M Quantization