Tagged "multimodal"
- NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model
- Pocket LLM v1.5.0 Brings Multimodal AI to Android with No Cloud Required
- Seed3D 2.0
- Show HN: We built an OCR server that can process 270 dense images/s on a 5090
- Developer Turns Phone Into Local LLM Server with Vision, Voice, and Tool Calling Capabilities
- DeepX and Hyundai Motor Group Robotics LAB Partner to Develop Next-Generation Physical AI Compute Platform
- PCMind: Local AI Analysis of Docs, Audio, Video and Images
- Qwen 3.5 Small – On-Device Multimodal Models Released
- Qwen3 Audio and Vision Support Now Available in llama.cpp
- Audio Processing Support Lands in llama.cpp with Gemma-4
- Parakeet Streaming ASR on Apple Silicon via CoreML
- CarryAI's Serverless Vision-Language Models Enable On-Device Multimodal AI
- VoxCPM2: New Open-Source TTS Model with Voice Cloning and Design
- VLA Learns How to Act. S2S Decides Whether the Motion Is Physically Trustworthy
- Context Window Optimization: Extending Gemma 4 Context Length Through Efficient Projection Quantization
- HunyuanOCR 1B: High-Quality OCR Now Viable on Budget Consumer Hardware
- Real-time Multimodal AI on Apple Silicon: Gemma E2B Demo Shows Practical Edge Deployment
- Kokoro TTS Achieves 20× Realtime Speed on CPU-Only On-Device Inference
- Free AI Video Clipper Using Scene and Speech-Based Segmentation
- IBM Granite 4.0 3B Vision: Compact Enterprise-Grade Document AI
- DaVinci-MagiHuman: Open-Source AI Model for Realistic Video Generation
- A Journey to a Reliable and Enjoyable Locally Hosted Voice Assistant
- Careless Whisper – Personal Local Speech to Text
- MiniMax-M2.7: New Compact Model Announced for Local Deployment
- Local Manga Translator: Production LLM Pipeline with YOLO, OCR, and Inpainting
- Qwen 3.5 Ultra-Compact Models Enable On-Device AI from Watches to Gaming
- PhotoPrism AI-Powered Photos App Brings Better Ollama Integration
- VoiceShelf: Fully Offline Android Audiobook Reader Using Kokoro TTS
- IBM Granite 4.0 1B Speech Model Released for Multilingual Speech Recognition
- MediaTek Advances Omni Model for Efficient Smartphone Inference
- Qwen 3.5 Small Models Released: 0.8B to 9B Parameters Optimized for On-Device Inference
- Qwen 3.5 0.8B Running in Browser with WebGPU via Transformers.js
- DeepSeek V4 Multimodal Model Coming Next Week With Image and Video Generation
- Qwen3.5 Series Releases Comprehensive Model Lineup Across All Tiers
- Qwen3's Voice Embeddings Enable Local Voice Cloning and Mathematical Voice Manipulation
- Qwen3 Demonstrates Advanced Voice Cloning via Embeddings
- PaddleOCR-VL Now Integrated into llama.cpp for Multilingual OCR
- NVIDIA Releases Dynamo v0.9.0: Infrastructure Overhaul With FlashIndexer and Multi-Modal Support
- Running Local LLMs and VLMs on Arduino UNO Q with yzma
- Local Vision-Language Models for Document OCR and PII Detection in Privacy-Critical Workflows
- Critical vLLM RCE Vulnerability Allows Remote Code Execution via Video Links
- ByteDance Releases Seed2.0 LLM with Complex Real-World Task Improvements
- Ming-flash-omni-2.0: 100B MoE Omni-Modal Model Released
- Student Releases Dhi-5B: Multimodal Model Trained for Just $1,200