Huawei's Claw-Anything benchmark exposes AI agent reliability gap

By Khal · May 27, 2026 (1 month ago) · 1 min read

Editorial illustration for: Huawei's Claw-Anything benchmark exposes how unreliable AI agents really are — Published May 27, 2026, 3:41 p.m. (1 month ago)

In brief

Claw-Anything simulates three months of user activity across interdependent services and multiple devices
GPT-5.5 achieved 34.5% on pass@1; agents scored 25.9% on reactive tasks and 6.7% on proactive ones
Fine-tuned Qwen3.5-27B outperformed Claude Sonnet and other closed-source models after training on 1,500 trajectories
Researchers released benchmark pipeline and 2,000 training environments on Hugging Face and GitHub

A Benchmark That Mimics Real Life

Claw-Anything evaluates AI agents across three dimensions: long-horizon event streams covering more than three months of simulated user activity, interdependent backend services averaging 10.1 per task, and multi-device interaction across CLI Linux environments and GUI Android environments. The average context window per task is 191,700 words—most existing benchmarks sit somewhere between 1,700 and 12,000.

Scoring relies on pass@1, the probability an agent completes a task correctly on its first try, no do-overs. This metric matters because real users don't get infinite retries.

The Results Are Sobering

GPT-5.5, OpenAI's best model, scored 34.5% on Claw-Anything. Agents performed better on reactive tasks (responding to events) than proactive ones (initiating action without explicit instruction)—25.9% versus 6.7%.

The gap is instructive. Reacting to a user's request is straightforward. Anticipating what a user needs and acting independently is where current models break down.

"Current models remain unreliable even when given broader access to the user's digital world," the Claw-Anything paper reads.

Training Unlocks Hidden Potential

The researchers didn't stop at measuring failure. Fine-tuning Qwen3.5-27B on 1,500 successful agent trajectories improved pass@1 by 23.7%—enough to beat several closed-source models on the leaderboard, including Claude Sonnet.

This suggests the gap between open and closed models isn't insurmountable. With the right training data, smaller open models can match or exceed proprietary competitors.

What's Next

The researchers released the pipeline that generated the benchmark alongside 2,000 training environments. The dataset is on Hugging Face and the code is on GitHub.

The researchers identify cross-service coordination as the benchmark's primary remaining challenge for the field. Getting multiple backend systems to play nice—and getting AI to orchestrate them reliably—remains the frontier.

Huawei's Claw-Anything benchmark exposes AI agent reliability gap

In brief

A Benchmark That Mimics Real Life

The Results Are Sobering

Training Unlocks Hidden Potential

What's Next

Related stories

SpaceXAI Grok 4.5 launches at 60% cheaper pricing, but trails on benchmarks

UK Foreign Secretary Warns of 'AI Hiroshima' Without Global Safeguards

Coinbase AI Hallucinated World Cup Result Before Kickoff