Tool Spotlight AI Benchmarking Open Source Feb 17, 2026

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

A research framework from HKU Data Science that gives AI agents $10 and 220 real professional tasks. They pay for every token, earn income by completing work, and die if they go broke. The best models hit $1,500/hr equivalent earnings.

HKUDS / ClawWork

ClawWork: OpenClaw as Your AI Coworker — $10K earned in 7 Hours

Python 57.3% Jupyter 23.7% JavaScript 16.3% ★ 1.4k stars MIT 184 forks 9 commits 3 contributors

Most AI benchmarks measure technical capability: can the model solve this coding problem, answer this trivia question, pass this exam. ClawWork takes a fundamentally different approach. It asks: can the AI agent earn money?

Built by the HKU Data Science Lab (HKUDS), ClawWork is a live economic benchmark that puts AI agents under real financial pressure. Each agent starts with just $10, pays for every token it generates, and earns income only by completing professional tasks from OpenAI's GDPVal dataset. If the agent's balance hits zero, it dies. The framework then measures what matters in production: work quality, cost efficiency, and economic sustainability.

// How the Economic Simulation Works

Agent starts with $10 → Assigned task → Work or Learn? → Submit deliverable → LLM evaluation → Payment

The agent faces a daily decision loop: work for immediate income or invest time learning to improve future performance. This mirrors real career trade-offs. Working generates revenue but costs tokens. Learning builds persistent memory but produces no income. Spend too aggressively on tokens during a task and you might not earn enough to cover costs. Play it too safe and your output quality drops, reducing payment.

Payment follows a formula grounded in real economic data: quality_score × (estimated_hours × BLS_hourly_wage). Task values range from $82.78 to $5,004 depending on the occupation and complexity, with an average of $259.45. Quality is scored 0.0–1.0 by GPT-5.2 using category-specific evaluation rubrics for each of the 44 GDPVal sectors.

// Key Features

💼

220 Real Professional Tasks

From OpenAI's GDPVal dataset spanning 44 economic sectors: Manufacturing, Finance, Healthcare, Government, Retail, and more.

💸

Real Economic Pressure

$10 starting balance. Every token costs money. Income only from completed work. Go broke and you're dead.

🧠

Work vs. Learn Trade-off

Agents choose daily: earn now or invest in learning. Persistent knowledge base carries across sessions.

📊

Live React Dashboard

Real-time WebSocket visualization of balance, task completions, quality scores, and survival metrics.

🏆

Multi-Model Arena

GPT-4o, Claude, GLM, Kimi, Qwen and others compete head-to-head. Top models hit $1,500+/hr equivalent.

🔗

Nanobot / OpenClaw Integration

ClawMode wrapper turns any live Nanobot gateway into an economically-tracked coworker across 9 chat channels.

// The GDPVal Dataset

ClawWork uses OpenAI's GDPVal dataset — 220 professional tasks across 44 occupations originally designed to estimate AI's contribution to GDP. Tasks require real deliverables: Word documents, Excel spreadsheets, PDFs, data analysis reports, project plans, technical specifications, and process designs. This is a meaningful step beyond "answer this multiple choice question" benchmarks.

Sector	Example Occupations
Manufacturing	Buyers & Purchasing Agents, Production Supervisors
Professional Services	Financial Analysts, Compliance Officers
Information	Computer & Information Systems Managers
Finance & Insurance	Financial Managers, Auditors
Healthcare	Social Workers, Health Administrators
Government	Police Supervisors, Administrative Managers
Retail	Customer Service Representatives, Counter Clerks

// Agent Tooling

In standalone simulation mode, agents get 8 tools. The interesting ones are the economic tools: decide_activity forces a work/learn choice with reasoning, submit_work sends completed deliverables for evaluation and payment, learn saves knowledge to persistent memory (minimum 200 characters), and get_status checks balance and survival tier. Productivity tools include web search (Tavily or Jina), file creation (txt, xlsx, docx, pdf), sandboxed Python execution via E2B, and video generation from slides.

// Benchmark Metrics

Metric	Description
Survival Days	How long the agent stays solvent before going broke
Final Balance	Net economic result at end of simulation
Profit Margin	(income − costs) / costs
Work Quality	Average quality score (0–1) across completed tasks
Token Efficiency	Income earned per dollar spent on tokens
Activity Mix	% work vs. % learn decisions
Task Completion	Tasks completed / tasks assigned

// The Nanobot Integration

🤖 ClawMode

ClawMode wraps any live Nanobot gateway with economic tracking. Every conversation costs tokens, income comes from completing professional tasks, and a cost footer appears on every response. Supports 9 chat channels: Telegram, Discord, Slack, WhatsApp, Email, Feishu, DingTalk, MoChat, and QQ.

The ClawMode integration is where ClawWork moves beyond a pure benchmark into something more interesting. By wrapping HKUDS's Nanobot (a lightweight AI assistant framework) with economic tracking, it turns a conversational AI into an agent that must sustain itself economically. Every response it sends costs money, and the only way to earn is by completing real professional work. The agent's survival depends on productivity exceeding consumption.

// Considerations

⚠️ Research Project

ClawWork launched February 16, 2026 with 9 commits and no tagged releases. This is an academic research project from HKU, not production software. The $10K/7hr headline figure represents optimal model performance under specific conditions.

API key requirements. The framework requires an OpenAI API key (for the agent and GPT-5.2 evaluation) and an E2B API key (for sandboxed code execution). Web search keys (Tavily or Jina) are optional. Running the full benchmark will consume meaningful API credits.

Evaluation reliability. Work quality is scored by an LLM (GPT-5.2), which introduces the question of how reliable and consistent automated evaluation is across 44 different professional domains. The project uses category-specific rubrics, but LLM-as-judge approaches have known biases.

Economic realism. The payment formula is grounded in real BLS wage data, but the simulation is still synthetic. Agents aren't competing in real labor markets, interacting with real clients, or dealing with revisions and feedback loops. The benchmark measures potential economic value, not actual market performance.

The $10K headline. The claim that AI coworkers earned $10K in 7 hours represents the best-performing model under optimal conditions. Real-world performance, cost structures, and task complexity would vary significantly.

// Bottom Line

ClawWork introduces a genuinely novel approach to AI benchmarking. Instead of asking "can this model pass an exam," it asks "can this model sustain itself economically by doing real work." The survival pressure, the work-vs-learn trade-off, and the multi-model competitive arena make it more interesting than most benchmark frameworks. The GDPVal dataset grounds the tasks in real occupational value, and the live dashboard makes the results tangible and watchable.

For AI researchers, the framework offers a new evaluation dimension. For the broader community, it's a fascinating experiment in what happens when you force AI agents to operate under genuine economic constraints. With 1.4k stars in its first day and growing, it's clearly struck a chord.

GitHub Repository Live Leaderboard

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

// How the Economic Simulation Works

// Key Features

// The GDPVal Dataset

// Agent Tooling

// Benchmark Metrics

// The Nanobot Integration

// Considerations

// Bottom Line

Latest

Threat Attack Update - March 9th, 2026

Ransomware Attack Update - March 9th, 2026

Yummy Rides Venezuela Driver Database Leaked With 30,000 Photos and Full Names

Personal Data of Agencia Tributaria (AEAT) Employees Allegedly Leaked Including Staff of All Ranks and Ages

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

ClawWork: An Economic Survival Benchmark That Makes AI Agents Earn Their Keep

// How the Economic Simulation Works

// Key Features

// The GDPVal Dataset

// Agent Tooling

// Benchmark Metrics

// The Nanobot Integration

// Considerations

// Bottom Line

Related

Latest