Skip to content

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

Tool Spotlight AI Benchmarking Open Source Feb 17, 2026

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

A research framework from HKU Data Science that gives AI agents $10 and 220 real professional tasks. They pay for every token, earn income by completing work, and die if they go broke. The best models hit $1,500/hr equivalent earnings.

HKUDS / ClawWork

ClawWork: OpenClaw as Your AI Coworker — $10K earned in 7 Hours

Python 57.3% Jupyter 23.7% JavaScript 16.3% ★ 1.4k stars MIT 184 forks 9 commits 3 contributors

Most AI benchmarks measure technical capability: can the model solve this coding problem, answer this trivia question, pass this exam. ClawWork takes a fundamentally different approach. It asks: can the AI agent earn money?

Built by the HKU Data Science Lab (HKUDS), ClawWork is a live economic benchmark that puts AI agents under real financial pressure. Each agent starts with just $10, pays for every token it generates, and earns income only by completing professional tasks from OpenAI's GDPVal dataset. If the agent's balance hits zero, it dies. The framework then measures what matters in production: work quality, cost efficiency, and economic sustainability.

// How the Economic Simulation Works

Agent starts with $10 Assigned task Work or Learn? Submit deliverable LLM evaluation Payment

The agent faces a daily decision loop: work for immediate income or invest time learning to improve future performance. This mirrors real career trade-offs. Working generates revenue but costs tokens. Learning builds persistent memory but produces no income. Spend too aggressively on tokens during a task and you might not earn enough to cover costs. Play it too safe and your output quality drops, reducing payment.

Payment follows a formula grounded in real economic data: quality_score × (estimated_hours × BLS_hourly_wage). Task values range from $82.78 to $5,004 depending on the occupation and complexity, with an average of $259.45. Quality is scored 0.0–1.0 by GPT-5.2 using category-specific evaluation rubrics for each of the 44 GDPVal sectors.

// Key Features

💼
220 Real Professional Tasks
From OpenAI's GDPVal dataset spanning 44 economic sectors: Manufacturing, Finance, Healthcare, Government, Retail, and more.
💸
Real Economic Pressure
$10 starting balance. Every token costs money. Income only from completed work. Go broke and you're dead.
🧠
Work vs. Learn Trade-off
Agents choose daily: earn now or invest in learning. Persistent knowledge base carries across sessions.
📊
Live React Dashboard
Real-time WebSocket visualization of balance, task completions, quality scores, and survival metrics.
🏆
Multi-Model Arena
GPT-4o, Claude, GLM, Kimi, Qwen and others compete head-to-head. Top models hit $1,500+/hr equivalent.
🔗
Nanobot / OpenClaw Integration
ClawMode wrapper turns any live Nanobot gateway into an economically-tracked coworker across 9 chat channels.

// The GDPVal Dataset

ClawWork uses OpenAI's GDPVal dataset — 220 professional tasks across 44 occupations originally designed to estimate AI's contribution to GDP. Tasks require real deliverables: Word documents, Excel spreadsheets, PDFs, data analysis reports, project plans, technical specifications, and process designs. This is a meaningful step beyond "answer this multiple choice question" benchmarks.

SectorExample Occupations
ManufacturingBuyers & Purchasing Agents, Production Supervisors
Professional ServicesFinancial Analysts, Compliance Officers
InformationComputer & Information Systems Managers
Finance & InsuranceFinancial Managers, Auditors
HealthcareSocial Workers, Health Administrators
GovernmentPolice Supervisors, Administrative Managers
RetailCustomer Service Representatives, Counter Clerks

// Agent Tooling

In standalone simulation mode, agents get 8 tools. The interesting ones are the economic tools: decide_activity forces a work/learn choice with reasoning, submit_work sends completed deliverables for evaluation and payment, learn saves knowledge to persistent memory (minimum 200 characters), and get_status checks balance and survival tier. Productivity tools include web search (Tavily or Jina), file creation (txt, xlsx, docx, pdf), sandboxed Python execution via E2B, and video generation from slides.

// Benchmark Metrics

MetricDescription
Survival DaysHow long the agent stays solvent before going broke
Final BalanceNet economic result at end of simulation
Profit Margin(income − costs) / costs
Work QualityAverage quality score (0–1) across completed tasks
Token EfficiencyIncome earned per dollar spent on tokens
Activity Mix% work vs. % learn decisions
Task CompletionTasks completed / tasks assigned

// The Nanobot Integration

🤖 ClawMode
ClawMode wraps any live Nanobot gateway with economic tracking. Every conversation costs tokens, income comes from completing professional tasks, and a cost footer appears on every response. Supports 9 chat channels: Telegram, Discord, Slack, WhatsApp, Email, Feishu, DingTalk, MoChat, and QQ.

The ClawMode integration is where ClawWork moves beyond a pure benchmark into something more interesting. By wrapping HKUDS's Nanobot (a lightweight AI assistant framework) with economic tracking, it turns a conversational AI into an agent that must sustain itself economically. Every response it sends costs money, and the only way to earn is by completing real professional work. The agent's survival depends on productivity exceeding consumption.

// Considerations

⚠️ Research Project
ClawWork launched February 16, 2026 with 9 commits and no tagged releases. This is an academic research project from HKU, not production software. The $10K/7hr headline figure represents optimal model performance under specific conditions.

API key requirements. The framework requires an OpenAI API key (for the agent and GPT-5.2 evaluation) and an E2B API key (for sandboxed code execution). Web search keys (Tavily or Jina) are optional. Running the full benchmark will consume meaningful API credits.

Evaluation reliability. Work quality is scored by an LLM (GPT-5.2), which introduces the question of how reliable and consistent automated evaluation is across 44 different professional domains. The project uses category-specific rubrics, but LLM-as-judge approaches have known biases.

Economic realism. The payment formula is grounded in real BLS wage data, but the simulation is still synthetic. Agents aren't competing in real labor markets, interacting with real clients, or dealing with revisions and feedback loops. The benchmark measures potential economic value, not actual market performance.

The $10K headline. The claim that AI coworkers earned $10K in 7 hours represents the best-performing model under optimal conditions. Real-world performance, cost structures, and task complexity would vary significantly.

// Bottom Line

ClawWork introduces a genuinely novel approach to AI benchmarking. Instead of asking "can this model pass an exam," it asks "can this model sustain itself economically by doing real work." The survival pressure, the work-vs-learn trade-off, and the multi-model competitive arena make it more interesting than most benchmark frameworks. The GDPVal dataset grounds the tasks in real occupational value, and the live dashboard makes the results tangible and watchable.

For AI researchers, the framework offers a new evaluation dimension. For the broader community, it's a fascinating experiment in what happens when you force AI agents to operate under genuine economic constraints. With 1.4k stars in its first day and growing, it's clearly struck a chord.

Latest