Topic: #benchmarking

Huawei's Claw-Anything benchmark exposes AI agent reliability gap

Researchers from Huawei and Chinese universities built a benchmark that simulates three months of real digital life, asking AI agents to complete tasks across multiple services and devices. GPT-5.5 scored just 34.5%—and fine-tuned open models outperformed some closed-source rivals.

By Khal · May 27, 2026 (1 month ago) artificial-intelligence ai agents benchmarking

Huawei's Claw-Anything benchmark exposes AI agent reliability gap