Tagged "model-quantization"
- Show HN: Phonetic Formatter – Offline English Text to IPA on iPhone and iPad
- Run a Local LLM Server on Raspberry Pi with Remote Access Capabilities
- Google's Gemma 4 Brings Powerful On-Device AI to Phones and Laptops
- Netherlands Reaches Deal to Cut Reliance on U.S. Cloud Tech
- I Replaced My Local LLM With a Model Half Its Size and Got Better Results
- Llama 4 Scout on MLX: The Complete Apple Silicon Guide (2026)
- Externalization in LLM Agents: Unified Review of Memory and Harness Engineering
- 10GB VRAM Local LLM: The Complete Setup Guide (2026)
- The Open-Source AI Ecosystem Keeps Treating llama.cpp Like a Second-Class Citizen
- Minisforum Launches N5 Max AI NAS with OpenClaw
- Laimark – 8B LLM That Self-Improves on Consumer GPUs
- 115 TOPS in 0.67L: CHUWI AuBox X Packs On-Device AI Power Into a Palm-Sized Mini PC
- Building a Voice AI Wearable in a Casio F91W with Whisper and BLE
- Bonsai 1.7B in the Browser: A 290MB 1-bit LLM on WebGPU
- MiniMax M2.7 GGUF Investigation Reveals NaN Issues Affecting 21-38% of Hugging Face Conversions
- Running Gemma 4 on an iPhone 13 Pro
- Sovereign AI: Why the Next GPT Will Be Born in Our Living Rooms
- MiniMax M2.7 Achieves SOTA Performance Under 64GB on Mac with TQ Quantization
- Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
- Qwen3 Audio and Vision Support Now Available in llama.cpp
- MiniMax-M2.7 Delivers Exceptional Performance on Consumer Hardware
- Unsloth Completes Comprehensive MiniMax M2.7 GGUF Quantization Suite
- Universal Knowledge Store and Grounding Layer for AI Reasoning Engines
- MiniMax M2.7 Released: New Model Available for Local Deployment
- The Best Local AI Model for Home Assistant Isn't Always the Biggest One
- Intel Arc Pro B70 32GB Achieves 12 Tokens/Sec on Qwen 3.5-27B
- Gemma 4 31B vs Qwen 3.5 27B: Comprehensive Long Context Benchmark
- LLM Wiki v2: Extended Knowledge Base for LLM Practitioners
- Running a 1.7B Parameters LLM on an Apple Watch
- Gemma 4 Support Stabilized in Llama.cpp
- Gemma 4 GGUF Models Updated with Critical Quantization Fixes
- EXAONE 4.5 33B Model Released with Multiple Quantization Formats
- Comprehensive Benchmark: 37 LLMs Tested on MacBook Air M5 With Open-Source Tool
- TurboQuant-Optimized llama.cpp Fork Delivers GFX906 GPU Acceleration
- Quantization Strategy Comparison: Balancing Quality and Speed on Consumer Laptops
- Context Window Optimization: Extending Gemma 4 Context Length Through Efficient Projection Quantization
- Google AI Edge Gallery Tops App Store Charts with On-Device Gemma 4
- Gemma 4 31B Achieves Exceptional Performance on Local Hardware
- Qwen 3.6 Free Model Available via OpenRouter
- Qualcomm Snapdragon Innovations Enable Advanced On-Device AI for Wearables
- DGX Spark Hardware Limitations: Missing NVFP4 Support Undermines Local AI Value Proposition
- GMKtec NucBox K17 Launches with 97 TOPS AI Performance for Local Inference
- Gemma 4 26B MoE Emerges as Optimal All-Around Local Model for Consumer Hardware
- Nex Life Logger: Local Activity Tracker with AI Agent Integration
- Google Gemma 4 Released with GGUF Quantizations
- Gemma 4 26B A4B Outperforms Qwen 3.5 35B on Apple Silicon
- Gemma 4 2B Successfully Runs on Raspberry Pi 5
- Gemma 4 on Arm: Optimized On-Device AI for Mobile and Edge Deployment
- Qwen 3.6-Plus Released
- Bonsai 1-Bit Models Deliver Exceptional Local Inference Performance
- Satcove – Query 5 AI Models Simultaneously and Get Structured Verdicts
- Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains
- PrismML Announces 1-Bit Bonsai: First Commercially Viable 1-Bit LLMs
- Ollama Launches Pi: The Minimal Coding Agent That Powers OpenClaw Is Now Yours to Customize
- Select the Right Hardware for Your Local LLM Deployment with This Online Guide
- TurboQuant: Understanding the Quantization Breakthrough
- Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference
- ESP32-S31: 320MHz 2-Core Microcontroller with 512KB SRAM and Networking
- TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
- Qwen3 512k Context via TurboQuant on Mac mini
- TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
- RotorQuant: 10-19x Faster Quantisation Alternative Using Clifford Algebra
- Coding Implementation to Run Qwen3.5 Reasoning Models Distilled With Claude-Style Thinking Using GGUF and 4-Bit Quantization
- Quantization Reveals Outliers Impacting LLM Accuracy
- Apple Gets Full Gemini Access and Uses Distillation to Build Lightweight On-Device AI
- Intel Launches Arc Pro B70/B65 with 32GB VRAM for Local AI Inference
- Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
- Apple Plans Slimmed-Down Gemini Models for Local iPhone AI Features
- Google TurboQuant: Extreme Compression for Local LLM Deployment
- Running an Open-Weight LLM Locally on an Apple Watch
- OmniCoder v2 Released: Improved Code Generation for Local Deployment
- Researcher Successfully Runs Local LLMs on Legacy "Dead" GPU With Surprising Results