Tagged "model-evaluation"

Making AI Code Review Measurable 8 July 2026
Lessons from Building Evals for Financial AI Agents 22 June 2026
DeepSWE Benchmark Updated with GLM 5.2 and Expanded Model Comparisons 21 June 2026
Show HN: Veritrooper – find what your AI gets wrong about your own docs 8 June 2026
LLM Hallucinations in the Wild 12 May 2026
Control AI Risk with Pre-Built Frameworks and Ready-to-Run Evaluations 4 May 2026
NIST's CAISI Evaluation of DeepSeek V4 Pro Finds It On Par with GPT-5 3 May 2026
How to Test AI Agents When They Never Give the Same Answer Twice 3 May 2026
AI Coding Tools Are Silently Disagreeing with Each Other 2 May 2026
Claude vs Local LLM: Real-World Prompt Comparison Reveals Trade-offs 20 April 2026
LLM Personalization Breaks Down in High-Stakes Finance 16 April 2026
Google's Gemma 4: The Most Practical Local LLM Despite Not Being The Smartest 16 April 2026
MiniMax M2.7 GGUF Investigation Reveals NaN Issues Affecting 21-38% of Hugging Face Conversions 15 April 2026
Show HN: SkillCompass – Open-Source Quality Evaluator for Your AI Skills 13 April 2026
Running Same Prompts Through Claude and Local LLM Revealed Unexpected Results 13 April 2026
Gemma 4 26B MoE Emerges as Optimal All-Around Local Model for Consumer Hardware 5 April 2026
New Open-Weight Models Released: GigaChat-3.1-Ultra and Lightning Variants 25 March 2026