Tagged "llm-benchmarking"

DeepSWE v1.1 – Updated Execution and Grading for Software Engineering Tasks 24 June 2026
Benchmarking a Portable AI Workstation: Lenovo ThinkPad P16 Gen 3, Part 2 21 May 2026
LLM temporal and causal reasoning research 15 May 2026
Comprehensive Benchmark: 37 LLMs Tested on MacBook Air M5 With Open-Source Tool 7 April 2026
Gemma 4 31B Achieves Third Place on FoodTruck Bench, Beating Larger Models 5 April 2026
YC-Bench: GLM-5 Matches Claude Opus 4.6 at 11× Lower Cost 4 April 2026
Forensic Beats Mem0 with 90.1% on LOCOMO Benchmark 28 March 2026