Huawei's Claw-Anything benchmark exposes AI agent reliability gap
In brief
- Claw-Anything simulates three months of user activity across interdependent services and multiple devices
- GPT-5.5 achieved 34.5% on pass@1; agents scored 25.9% on reactive tasks and 6.7% on proactive ones
- Fine-tuned Qwen3.5-27B outperformed Claude Sonnet and other closed-source models after training on 1,500 trajectories
- Researchers released benchmark pipeline and 2,000 training environments on Hugging Face and GitHub
A Benchmark That Mimics Real Life
Claw-Anything evaluates AI agents across three dimensions: long-horizon event streams covering more than three months of simulated user activity, interdependent backend services averaging 10.1 per task, and multi-device interaction across CLI Linux environments and GUI Android environments. The average context window per task is 191,700 words—most existing benchmarks sit somewhere between 1,700 and 12,000.
Scoring relies on pass@1, the probability an agent completes a task correctly on its first try, no do-overs. This metric matters because real users don't get infinite retries.
The Results Are Sobering
GPT-5.5, OpenAI's best model, scored 34.5% on Claw-Anything. Agents performed better on reactive tasks (responding to events) than proactive ones (initiating action without explicit instruction)—25.9% versus 6.7%.
The gap is instructive. Reacting to a user's request is straightforward. Anticipating what a user needs and acting independently is where current models break down.
"Current models remain unreliable even when given broader access to the user's digital world," the Claw-Anything paper reads.
Training Unlocks Hidden Potential
The researchers didn't stop at measuring failure. Fine-tuning Qwen3.5-27B on 1,500 successful agent trajectories improved pass@1 by 23.7%—enough to beat several closed-source models on the leaderboard, including Claude Sonnet.
This suggests the gap between open and closed models isn't insurmountable. With the right training data, smaller open models can match or exceed proprietary competitors.
What's Next
The researchers released the pipeline that generated the benchmark alongside 2,000 training environments. The dataset is on Hugging Face and the code is on GitHub.
The researchers identify cross-service coordination as the benchmark's primary remaining challenge for the field. Getting multiple backend systems to play nice—and getting AI to orchestrate them reliably—remains the frontier.


