Tagged "memory-optimization"

Tether AI Upgrades QVAC SDK With TurboQuant for Data Center-Sized Memory on Everyday Devices 2 June 2026
Meet Memory OS: A 6-Layer Open-Source Memory Stack Built on Hermes Agent 2 June 2026
Samsung's Exynos 2800 Brings HBM Memory to Mobile AI, Enabling Faster Local Model Inference 26 May 2026
Anker Soundcore Liberty 5 Pro Earbuds Feature Dedicated On-Device AI Chip with Touch Screen 26 May 2026
Redditor Successfully Runs 1 Trillion Parameter LLM Using Cheap Intel Optane DIMMs 24 May 2026
llama.cpp MTP Leak Fix Stabilizes Local AI Agents 22 May 2026
AMD's New Ryzen AI Max Pro 400 with 192GB LPDDR5X Memory 21 May 2026
Samsung's Exynos 2800 Brings Significant On-Device AI Capabilities 18 May 2026
DwarfStar 4: Native Inference Engine Optimized for DeepSeek V4 Flash 16 May 2026
Arm and Google Collaborate on On-Device AI Optimization Techniques 15 May 2026
Running Local AI LLMs on Mini PCs Without NVIDIA GPUs 14 May 2026
Local LLM Persistent Context Prevents Repetitive Mistakes 14 May 2026
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training 14 May 2026
Claude Opus 4.7 System Prompt Leaks Raise Local Deployment Questions 14 May 2026
Running a Local LLM on a 12-Year-Old Raspberry Pi 13 May 2026
Microsoft Researchers Find AI Models and Agents Can't Handle Long-Running Tasks 12 May 2026
AMD's vLLM-ATOM Plugin Supercharges DeepSeek-R1 and Kimi-K2 Inference on MI350/MI400 12 May 2026
Critical Ollama Memory Leak Vulnerability Exposes 300,000 Servers Globally 9 May 2026
Critical Ollama Memory Leak Vulnerability Exposes 300,000 Servers Globally 8 May 2026
Show HN: A Local-First Agentic Knowledge Manager 8 May 2026
0ctx – Local-First Project Memory for AI Workflows 8 May 2026
Critical Ollama Memory Leak Vulnerability Exposes 300,000 Servers Globally 7 May 2026
Agentic AI Community Focus: Building Local Agents in 2026 6 May 2026
Show HN: Memex, Claude Memory via Local RAG with MCP and Offline Embeddings 5 May 2026
Gemma 4 Just Replaced My Whole Local LLM Stack 4 May 2026
Running a Serious AI Model on a Consumer GPU Just Got Easier and That Matters More Than the Benchmark 3 May 2026
Xmemory: Benchmarking Structured AI Memory Against RAG and Hybrid RAG 1 May 2026
GraphOS: Visual Runtime and Debugger for AI Agents with Local-First Execution 29 April 2026
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful 28 April 2026
Unsloth's Custom Kernels Make LLM Fine-Tuning Viable on Consumer GPUs 27 April 2026
Elastic KV Cache Memory Breakthrough Enables Efficient Bursty LLM Serving and GPU Sharing 26 April 2026
Can IBM's RITS Platform and vLLM Reset the Bar for Enterprise AI Access? 26 April 2026
Show HN: A Karpathy-Style LLM Wiki Your Agents Maintain 25 April 2026
Google's Gemma 4 Brings Powerful On-Device AI to Phones and Laptops 25 April 2026
I Replaced My Local LLM With a Model Half Its Size and Got Better Results 24 April 2026
Externalization in LLM Agents: Unified Review of Memory and Harness Engineering 23 April 2026
10GB VRAM Local LLM: The Complete Setup Guide (2026) 23 April 2026
Llama.cpp's Auto Fit Feature Quietly Reshapes Local AI Inference on Consumer Hardware 22 April 2026
Bun v1.3.13 20 April 2026
Unweight: Lossless MLP Weight Compression for LLM Inference 18 April 2026
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation 18 April 2026
SigMap – Shrink AI Coding Context 97% with Auto-Scaling Token Budget 15 April 2026
Dynamic Expert Cache in llama.cpp Achieves 27% Faster Inference on Large MoE Models 15 April 2026
GBrain – System to Make Your AI Agent Better Reflect You 15 April 2026
MiniMax M2.7 Achieves SOTA Performance Under 64GB on Mac with TQ Quantization 14 April 2026
Researchers Achieve 1-Bit Quantization of OLMo-3 7B Using Distillation 13 April 2026
Universal Knowledge Store and Grounding Layer for AI Reasoning Engines 12 April 2026
A Deep Dive into Tinygrad AI Compiler 12 April 2026
Self-Hosted LLMs Transform Personal Knowledge Management Systems 11 April 2026
DMax: New Parallel Decoding Paradigm for Diffusion Language Models 11 April 2026
Building Offline AI Companions on Severely Constrained Hardware (8GB RAM) 10 April 2026
LLM Wiki v2: Extended Knowledge Base for LLM Practitioners 10 April 2026
Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support 9 April 2026
Octopoda: Open Source Memory Layer for Fully Offline AI Agents 7 April 2026
MemPalace, the Highest-Scoring AI Memory System Ever Benchmarked 7 April 2026
CricketBrain: Neuromorphic Signal Processor in Rust (0.175us/step, 944 bytes) 7 April 2026
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache 6 April 2026
Context Window Optimization: Extending Gemma 4 Context Length Through Efficient Projection Quantization 6 April 2026
GPU Memory for LLM Inference (Part 1) 6 April 2026
Vektor – Local-First Associative Memory for AI Agents 5 April 2026
DGX Spark Hardware Limitations: Missing NVFP4 Support Undermines Local AI Value Proposition 5 April 2026
Gemma 4 26B MoE Emerges as Optimal All-Around Local Model for Consumer Hardware 5 April 2026
Mixed Precision Quantization on MLX with TurboQuant Implementation 4 April 2026
Gemma 4 KV Cache Memory Issues Fixed in llama.cpp 4 April 2026
OpenUMA – Apple-Style Unified Memory for x86 AI Inference 3 April 2026
VRAM Optimization Technique Cuts Gemma 4 Memory Usage by 3x 3 April 2026
SmolLM2-360M Running on Samsung Galaxy Watch 4 with 74% Memory Reduction 2 April 2026
Show HN: Memsearch – Persistent, Cross-Agent, Cross-Session Memory for AI Agents 2 April 2026
Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac 1 April 2026
Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains 1 April 2026
Claw64 – Full Agentic Loop in <4KB on Commodore 64 1 April 2026
PrismML Announces 1-Bit Bonsai: First Commercially Viable 1-Bit LLMs 1 April 2026
DeepSeek V3 Complete Guide: Deploy and Optimize Local AI in 2026 30 March 2026
Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference 29 March 2026
Mixed KV Cache Quantization: Performance Risks and Pitfalls 29 March 2026
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context 28 March 2026
Forensic Beats Mem0 with 90.1% on LOCOMO Benchmark 28 March 2026
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice 27 March 2026
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching 26 March 2026
Running an Open-Weight LLM Locally on an Apple Watch 25 March 2026
Ultra-Large 400B-Class LLM Runs on iPhone in Test 25 March 2026
KV Cache Quantization Levels Benchmarked on SWE-bench: Practical Trade-offs for Local Inference 24 March 2026
FOMOE: Running 397B Parameter Qwen3.5 MoE at 5-9 tok/s on $2,100 Desktop Hardware 24 March 2026
Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives 22 March 2026
Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations 22 March 2026
Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting 22 March 2026
A Little Gap That Will Ensure the Future of AI Agents Being Autonomous 22 March 2026
Running an AI Agent on a 448KB RAM Microcontroller 21 March 2026
MacinAI Local brings functional LLM inference to classic Macintosh hardware 21 March 2026
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide 21 March 2026
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models 20 March 2026
NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor 20 March 2026
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform 20 March 2026
Mamba 3: State Space Model Architecture Optimized for Inference 18 March 2026
Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware 18 March 2026
Mistral Small 4 119B Released with NVFP4 Quantisation Support 17 March 2026
Researcher Discovers Universal "Danger Zone" in Transformer Model Architecture at 50% Depth 17 March 2026
The Moment AI Agents Stopped Being a Feature and Started Becoming a System 17 March 2026
OpenClaw Isn't the Only Raspberry Pi AI Tool—Here Are 4 Others You Can Try This Week 16 March 2026
OmniCoder-9B: Efficient Coding Model for 8GB GPUs 16 March 2026
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage 15 March 2026
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems 14 March 2026
Best Local LLM Models 2026: Developer Comparison 14 March 2026
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens 14 March 2026
Qwodel – An Open-Source Unified Pipeline for LLM Quantization 12 March 2026
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs 12 March 2026
Apple M5 Max 128GB Benchmark Results for Local LLM Inference 12 March 2026
Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results 11 March 2026
SK Hynix Develops 1c LPDDR6 DRAM to Boost On-Device AI Performance in Mobile Devices 10 March 2026
Mnemos: Persistent Memory System for Local AI Agents 10 March 2026
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems 10 March 2026
HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026? 10 March 2026
FreeBSD 14.4 Released: Implications for Local LLM Deployment 10 March 2026
Qwen 3.5 Derestricted Model Available for Local Deployment 9 March 2026
How to Run Your Own Local LLM — 2026 Edition 9 March 2026
Engram – Open-Source Persistent Memory for AI Agents 9 March 2026
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide 8 March 2026
Mojo: Creating a Programming Language for an AI World with Chris Lattner 7 March 2026
Show HN: Asterode – Multi-Model AI App with Memory and Power Features 7 March 2026
The Emerging Role of SRAM-Centric Chips in AI Inference 6 March 2026
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs 6 March 2026
How to Run High-Performance LLMs Locally on the Arduino UNO Q 1 March 2026
Nummi – AI Companion with Memory and Daily Guidance 1 March 2026
Unsloth Dynamic 2.0 GGUFs 28 February 2026
Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second 28 February 2026
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware 28 February 2026
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080 28 February 2026
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080 28 February 2026
Running LLMs on Raspberry Pi and Edge Devices: A Practical Guide 26 February 2026
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required 26 February 2026
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals 25 February 2026
Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods 25 February 2026
What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup 25 February 2026
Which Web Frameworks Are Most Token-Efficient for AI Agents? 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation 23 February 2026
The Complete Stack for Local Autonomous Agents: From GGML to Orchestration 23 February 2026
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference 23 February 2026
O-TITANS: Orthogonal LoRA Framework for Gemma 3 with Google TITANS Memory Architecture 22 February 2026
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System 20 February 2026
Running Local LLMs and VLMs on Arduino UNO Q with yzma 19 February 2026
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs 19 February 2026
Local Vision-Language Models for Document OCR and PII Detection in Privacy-Critical Workflows 19 February 2026
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM 19 February 2026
InitRunner: YAML-Based AI Agent Framework with RAG and Memory 16 February 2026
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release 16 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 14 February 2026
NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x 14 February 2026
MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities 14 February 2026
GPT-OSS 120B Uncensored Model Released in Native MXFP4 Precision 14 February 2026
Context Management Identified as Real Bottleneck in AI-Assisted Coding 14 February 2026
Switching From Ollama and LM Studio to llama.cpp: Performance Benefits 13 February 2026
Ring-1T-2.5 Released with SOTA Deep Thinking Performance 13 February 2026
MiniMax M2.5: 230B Parameter MoE Model Coming to HuggingFace 13 February 2026
Ming-flash-omni-2.0: 100B MoE Omni-Modal Model Released 13 February 2026
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues 13 February 2026
Running Your Own AI Assistant for €19/Month: Complete Self-Hosting Guide 12 February 2026
Heaps Do Lie: Debugging a Memory Leak in vLLM 12 February 2026
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine 11 February 2026
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance 11 February 2026
Energy-Based Models Compared Against Frontier AI for Sudoku Solving 11 February 2026
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data 11 February 2026