Topic: #benchmarking
-
Huawei's Claw-Anything benchmark exposes AI agent reliability gap
Researchers from Huawei and Chinese universities built a benchmark that simulates three months of real digital life, asking AI agents to complete tasks across multiple services and devices. GPT-5.5 scored just 34.5%—and fine-tuned open models outperformed some closed-source rivals.