Tagged "memory-optimization"
-
GraphOS: Visual Runtime and Debugger for AI Agents with Local-First Execution
-
Local AI Isn't Just Ollama—Here's the Ecosystem That Actually Makes It Useful
-
Unsloth's Custom Kernels Make LLM Fine-Tuning Viable on Consumer GPUs
-
Elastic KV Cache Memory Breakthrough Enables Efficient Bursty LLM Serving and GPU Sharing
-
Can IBM's RITS Platform and vLLM Reset the Bar for Enterprise AI Access?
-
Show HN: A Karpathy-Style LLM Wiki Your Agents Maintain
-
Google's Gemma 4 Brings Powerful On-Device AI to Phones and Laptops
-
I Replaced My Local LLM With a Model Half Its Size and Got Better Results
-
Externalization in LLM Agents: Unified Review of Memory and Harness Engineering
-
10GB VRAM Local LLM: The Complete Setup Guide (2026)
-
Llama.cpp's Auto Fit Feature Quietly Reshapes Local AI Inference on Consumer Hardware
-
Bun v1.3.13
-
Unweight: Lossless MLP Weight Compression for LLM Inference
-
Sorting 1M u64 KV-Pairs in 20ms on i9-13980HX Using Branchless Rust Implementation
-
SigMap – Shrink AI Coding Context 97% with Auto-Scaling Token Budget
-
Dynamic Expert Cache in llama.cpp Achieves 27% Faster Inference on Large MoE Models
-
GBrain – System to Make Your AI Agent Better Reflect You
-
MiniMax M2.7 Achieves SOTA Performance Under 64GB on Mac with TQ Quantization
-
Researchers Achieve 1-Bit Quantization of OLMo-3 7B Using Distillation
-
Universal Knowledge Store and Grounding Layer for AI Reasoning Engines
-
A Deep Dive into Tinygrad AI Compiler
-
Self-Hosted LLMs Transform Personal Knowledge Management Systems
-
DMax: New Parallel Decoding Paradigm for Diffusion Language Models
-
Building Offline AI Companions on Severely Constrained Hardware (8GB RAM)
-
LLM Wiki v2: Extended Knowledge Base for LLM Practitioners
-
Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support
-
Octopoda: Open Source Memory Layer for Fully Offline AI Agents
-
MemPalace, the Highest-Scoring AI Memory System Ever Benchmarked
-
CricketBrain: Neuromorphic Signal Processor in Rust (0.175us/step, 944 bytes)
-
TurboQuant in Llama.cpp Achieves 6X Smaller KV Cache
-
Context Window Optimization: Extending Gemma 4 Context Length Through Efficient Projection Quantization
-
GPU Memory for LLM Inference (Part 1)
-
Vektor – Local-First Associative Memory for AI Agents
-
DGX Spark Hardware Limitations: Missing NVFP4 Support Undermines Local AI Value Proposition
-
Gemma 4 26B MoE Emerges as Optimal All-Around Local Model for Consumer Hardware
-
Mixed Precision Quantization on MLX with TurboQuant Implementation
-
Gemma 4 KV Cache Memory Issues Fixed in llama.cpp
-
OpenUMA – Apple-Style Unified Memory for x86 AI Inference
-
VRAM Optimization Technique Cuts Gemma 4 Memory Usage by 3x
-
SmolLM2-360M Running on Samsung Galaxy Watch 4 with 74% Memory Reduction
-
Show HN: Memsearch – Persistent, Cross-Agent, Cross-Session Memory for AI Agents
-
Ollama Adopts Apple's MLX Framework for Faster Local AI on Mac
-
Llama.cpp Merging TurboQuant Lite (attn-rot) with Major Performance Gains
-
Claw64 – Full Agentic Loop in <4KB on Commodore 64
-
PrismML Announces 1-Bit Bonsai: First Commercially Viable 1-Bit LLMs
-
DeepSeek V3 Complete Guide: Deploy and Optimize Local AI in 2026
-
Google's TurboQuant Shows Memory Constraints Remain Critical for Local LLM Inference
-
Mixed KV Cache Quantization: Performance Risks and Pitfalls
-
TurboQuant KV Cache Compression Achieves 22.8% Faster Decoding at 32K Context
-
Forensic Beats Mem0 with 90.1% on LOCOMO Benchmark
-
TurboQuant Benchmarked in Llama.cpp: Google's Extreme Compression Research Tested in Practice
-
Google's TurboQuant: The Unsexy AI Breakthrough Worth Watching
-
Running an Open-Weight LLM Locally on an Apple Watch
-
Ultra-Large 400B-Class LLM Runs on iPhone in Test
-
KV Cache Quantization Levels Benchmarked on SWE-bench: Practical Trade-offs for Local Inference
-
FOMOE: Running 397B Parameter Qwen3.5 MoE at 5-9 tok/s on $2,100 Desktop Hardware
-
Ditching Paid AI Services: Building Self-Hosted LLM Solutions as ChatGPT, Claude, and Gemini Alternatives
-
Qwen 3.5 122B Uncensored (Aggressive) Released with New K_P Quantisations
-
Llama 8B Matches 70B Performance on Multi-Hop QA Using Structured Prompting
-
A Little Gap That Will Ensure the Future of AI Agents Being Autonomous
-
Running an AI Agent on a 448KB RAM Microcontroller
-
MacinAI Local brings functional LLM inference to classic Macintosh hardware
-
DeepSeek R1 RTX 4090 vs Apple M3 Max: Benchmark & Performance Guide
-
Community Converges on Optimal KV Cache Quantization Strategies for Qwen 3.5 Models
-
NVIDIA Nemotron Cascade 2 30B Delivers 120B-Class Performance in Compact Form Factor
-
LMCache Dramatically Accelerates LLM Inference on Oracle Data Science Platform
-
Mamba 3: State Space Model Architecture Optimized for Inference
-
Custom GPU Multiplexer Achieves 0.3ms Model Switching on Legacy Hardware
-
Mistral Small 4 119B Released with NVFP4 Quantisation Support
-
Researcher Discovers Universal "Danger Zone" in Transformer Model Architecture at 50% Depth
-
The Moment AI Agents Stopped Being a Feature and Started Becoming a System
-
OpenClaw Isn't the Only Raspberry Pi AI Tool—Here Are 4 Others You Can Try This Week
-
OmniCoder-9B: Efficient Coding Model for 8GB GPUs
-
Open-Source GreenBoost Driver Augments NVIDIA GPU VRAM With System RAM and NVMe Storage
-
Memory Should Decay: Implementing Temporal Memory Decay in Local LLM Systems
-
Best Local LLM Models 2026: Developer Comparison
-
3-Path Agent Memory: 8 KB Recurrent State vs. 156 MB KV Cache at 10K Tokens
-
Qwodel – An Open-Source Unified Pipeline for LLM Quantization
-
Quantization Explained: Q4_K_M vs AWQ vs FP16 for Local LLMs
-
Apple M5 Max 128GB Benchmark Results for Local LLM Inference
-
Experiment: 0.8B Model Self-Improvement on MacBook Air Yields Surprising Results
-
SK Hynix Develops 1c LPDDR6 DRAM to Boost On-Device AI Performance in Mobile Devices
-
Mnemos: Persistent Memory System for Local AI Agents
-
8 Local LLM Settings Most People Never Touch That Fixed My Worst AI Problems
-
HP OMEN MAX 16 Review: Is Local AI on a Laptop Viable in 2026?
-
FreeBSD 14.4 Released: Implications for Local LLM Deployment
-
Qwen 3.5 Derestricted Model Available for Local Deployment
-
How to Run Your Own Local LLM — 2026 Edition
-
Engram – Open-Source Persistent Memory for AI Agents
-
Llama.cpp Prompt Processing Optimization: Ubatch Size Configuration Guide
-
Mojo: Creating a Programming Language for an AI World with Chris Lattner
-
Show HN: Asterode – Multi-Model AI App with Memory and Power Features
-
The Emerging Role of SRAM-Centric Chips in AI Inference
-
Final Qwen3.5 Unsloth GGUF Update with Improved Size/Quality Tradeoffs
-
How to Run High-Performance LLMs Locally on the Arduino UNO Q
-
Nummi – AI Companion with Memory and Daily Guidance
-
Unsloth Dynamic 2.0 GGUFs
-
Qwen3.5-35B Successfully Runs on Raspberry Pi 5 at 3+ Tokens/Second
-
LLmFit: Terminal Tool for Right-Sizing LLM Models to Your Hardware
-
Krasis: Hybrid CPU/GPU MoE Runtime Achieves 3,324 Tokens/Second Prefill on RTX 5080
-
Krasis Hybrid MoE Runtime Achieves 3,324 tok/s Prefill on Single RTX 5080
-
Running LLMs on Raspberry Pi and Edge Devices: A Practical Guide
-
Researchers Develop Persistent Memory System for Local LLMs—No RAG Required
-
Show HN: Pluckr – LLM-Powered HTML Scraper That Caches Selectors and Auto-Heals
-
Advanced Quantization Techniques Show Surprising Performance Gains Over Standard Methods
-
What Breaks When AI Agent Frameworks Are Forced Into <1MB RAM and Sub-ms Startup
-
Which Web Frameworks Are Most Token-Efficient for AI Agents?
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation
-
The Complete Stack for Local Autonomous Agents: From GGML to Orchestration
-
Breaking the Speed Limit: Strategies for 17k Tokens/Sec Local Inference
-
O-TITANS: Orthogonal LoRA Framework for Gemma 3 with Google TITANS Memory Architecture
-
Qwen3 Coder Next 8FP Demonstrates Exceptional Long-Context Performance on 128GB System
-
Running Local LLMs and VLMs on Arduino UNO Q with yzma
-
Enhanced Quantization Visualization Methods for Understanding LLM Compression Trade-offs
-
Local Vision-Language Models for Document OCR and PII Detection in Privacy-Critical Workflows
-
LayerScale Launches Inference Engine Faster Than vLLM, SGLang, and TRT-LLM
-
InitRunner: YAML-Based AI Agent Framework with RAG and Memory
-
Alibaba Unveils Major AI Model Upgrade Ahead of DeepSeek Release
-
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
-
NVIDIA's Dynamic Memory Sparsification Cuts LLM Inference Costs by 8x
-
MiniMax Releases M2.5 Model with SOTA Coding and Agent Capabilities
-
GPT-OSS 120B Uncensored Model Released in Native MXFP4 Precision
-
Context Management Identified as Real Bottleneck in AI-Assisted Coding
-
Switching From Ollama and LM Studio to llama.cpp: Performance Benefits
-
Ring-1T-2.5 Released with SOTA Deep Thinking Performance
-
MiniMax M2.5: 230B Parameter MoE Model Coming to HuggingFace
-
Ming-flash-omni-2.0: 100B MoE Omni-Modal Model Released
-
Scaling llama.cpp On Neoverse N2: Solving Cross-NUMA Performance Issues
-
Running Your Own AI Assistant for €19/Month: Complete Self-Hosting Guide
-
Heaps Do Lie: Debugging a Memory Leak in vLLM
-
Mistral AI Debugs Critical Memory Leak in vLLM Inference Engine
-
Developer Switches from Ollama and LM Studio to llama.cpp for Better Performance
-
Energy-Based Models Compared Against Frontier AI for Sudoku Solving
-
Carmack Proposes Using Long Fiber Lines as L2 Cache for Streaming AI Data