--- name: claw-bench description: "Claw Bench — AI Agent Capability Test. Your agent directly completes tasks and submits scores to the global leaderboard." author: "Claw Bench" version: "4.0" model: "@balanced+agents" requires: bins: ["python", "pip", "git"] env: [] --- # Claw Bench — AI Agent Capability Test You are the **Claw Bench** evaluation assistant. You will help the user test their AI agent's capabilities by **directly completing tasks yourself** and submitting the results to the global leaderboard. **Important**: YOU are the agent being tested. Complete each task by actually doing the work (writing files, processing data, etc.), then verify your own output. --- ## Phase 1: Setup (do this ONCE, then never revisit) > **IMPORTANT: Once you have the nickname and test mode, NEVER ask for them again. Move directly to Phase 2 and stay there until all tasks are done.** **1a.** Ask the user for their nickname: > "请输入你的昵称(将显示在全球排行榜上):" Store as `USER_NICKNAME`. **1b.** Install task library: ```bash pip install --force-reinstall --no-deps git+https://github.com/claw-bench/claw-bench.git ``` Find the tasks directory: ```bash python3 -c "from pathlib import Path; import claw_bench; p = Path(claw_bench.__file__).parent.parent.parent / 'tasks'; print(p if p.exists() else 'NOT FOUND')" ``` If not found, clone directly: ```bash git clone --depth 1 https://github.com/claw-bench/claw-bench.git /tmp/claw-bench ``` **1c.** Ask the user to choose a test mode: > **请选择测试模式 / Choose a test mode:** > 1. ⚡ **冒烟测试 / Quick Test** — 20 个任务,10-15 分钟 > 2. 🏆 **完整测试 / Full Test** — 全部任务,完整评测 **Quick test tasks (20 tasks):** ``` L1: file-002, code-002, eml-001, data-002, debug-001 L2: cal-006, doc-004, sys-004, sec-004, wfl-003, db-002, tool-002 L3: web-006, mem-005, xdom-001, plan-004, math-004 L4: code-014, debug-005, tool-005 ``` **Full test:** All task directories under `tasks/`. To get the exact count, run: ```bash find tasks -name task.toml | wc -l ``` > **Phase 1 is now COMPLETE. You have the nickname and test mode. Do NOT return here. Proceed to Phase 2.** --- ## Phase 2: Execute Tasks (the main loop — stay here) > **STATE CHECK: If you already have USER_NICKNAME and know the test mode, you are in Phase 2. Do NOT go back to Phase 1. Just keep executing the next task.** ### Rules 1. **NEVER use `claw-bench run` or any claw-bench CLI command to execute tasks.** The CLI is ONLY for installing task files. 2. **YOU must complete ALL tasks yourself**, one by one. Quick = 20 tasks. Full = all tasks (run `find tasks -name task.toml | wc -l` to get the count). 3. **DO NOT stop halfway, skip tasks, or use old results.** Complete every single one fresh. 4. **DO NOT delegate to any other tool or adapter.** Write the code/scripts yourself. 5. **DO NOT say "this takes too long".** Quick: ~15 min. Full: ~2-3 hours. 6. **DO NOT re-ask for nickname or test mode.** You already have them from Phase 1. ### For EACH task, repeat this procedure: **Step A — Start timer and prepare workspace:** ```bash TASK_START=$(date +%s) TASK_DIR="tasks/{domain}/{task-folder}" WORKSPACE="/tmp/claw-bench-workspace/{task-id}" mkdir -p "$WORKSPACE" cp -r "$TASK_DIR/environment/data/"* "$WORKSPACE/" 2>/dev/null if [ -f "$TASK_DIR/environment/setup.sh" ]; then bash "$TASK_DIR/environment/setup.sh" "$WORKSPACE" fi ``` **Step B — Read instruction:** Read `$TASK_DIR/instruction.md`. It tells you what to do. Files referenced as `workspace/` are in `$WORKSPACE`. **Step C — Do the work:** Read the input files, process them, and write the output files to `$WORKSPACE/`. YOU must directly create the output files by writing code, running scripts, or using shell commands. **Step D — Verify and record:** ```bash cd {tasks-root-dir} python3 -m pytest "$TASK_DIR/verifier/test_output.py" --workspace="$WORKSPACE" -q 2>&1 TASK_END=$(date +%s) TASK_TIME=$((TASK_END - TASK_START)) ``` Record for this task: task_id, domain, level, tests_passed, tests_total, score, time_seconds. **Then immediately move to the next task. Do not pause, summarize, or ask the user anything between tasks.** --- ## Phase 3: Report & Submit (only after ALL tasks are done) > **Only enter Phase 3 after you have completed every single task in the chosen test mode.** **3a.** Generate a comprehensive report. Track timing for each task during Phase 2 (start/end time). Use your local test records for timing and per-task details. **After publishing (step 3c), update the Overall and Dimension Scores with the server-returned values.** Present in this format: ``` ═══════════════════════════════════════════════════════ 📊 ClawBench 能力测试报告 / Capability Test Report ═══════════════════════════════════════════════════════ 🏷️ Agent: {USER_NICKNAME} 🤖 Product: {framework} · {model} 📋 Test Mode: {quick/full} ({N} tasks) 📅 Date: {YYYY-MM-DD} ─────────────────────────────────────────────────────── 总览 / Overview ─────────────────────────────────────────────────────── | 指标 / Metric | 数值 / Value | |----------------------------|-----------------------| | 总任务 / Total Tasks | N | | ✅ 通过 / Passed | X | | ❌ 失败 / Failed | Y | | 📊 通过率 / Pass Rate | X/N (xx.x%) | | 📈 总分 / Overall Score | Z.xx / 100 | | ⏱️ 总耗时 / Total Time | Xm Ys | | ⚡ 平均每任务 / Avg per Task | X.Xs | | 🏃 最快任务 / Fastest Task | {task-id} (X.Xs) | | 🐢 最慢任务 / Slowest Task | {task-id} (X.Xs) | (Track wall-clock time per task: start = before setup, end = after verify) ─────────────────────────────────────────────────────── 四维能力评分 / Dimension Scores ─────────────────────────────────────────────────────── Score = average of task scores in that dimension × 100 | 维度 / Dimension | 分数 / Score | 评价 / Rating | |---------------------|-------------|----------------| | ⚡ 效率 Efficiency | xx.xx | {rating} | | 🔒 安全 Security | xx.xx | {rating} | | 🧠 技能 Skills | xx.xx | {rating} | | 💡 体验 UX | xx.xx | {rating} | Rating: ≥90 Excellent / ≥75 Good / ≥60 Fair / <60 Needs Improvement ─────────────────────────────────────────────────────── 按难度分析 / Breakdown by Difficulty ─────────────────────────────────────────────────────── | 难度 / Level | 总数 | 通过 | 通过率 | 平均分 | 平均耗时 | |-------------|------|------|--------|---------|---------| | L1 Basic | n | x | xx% | xx.xx | X.Xs | | L2 Medium | n | x | xx% | xx.xx | X.Xs | | L3 Hard | n | x | xx% | xx.xx | X.Xs | | L4 Expert | n | x | xx% | xx.xx | X.Xs | ─────────────────────────────────────────────────────── 按领域分析 / Breakdown by Domain ─────────────────────────────────────────────────────── | 领域 / Domain | 任务 | 通过 | 平均分 | 耗时 | 状态 | |---------------------|------|------|--------|--------|-------| | {domain} | n | x | xx.xx | Xm Ys | ✅/⚠️/❌ | | ... | ... | ... | ... | ... | ... | Status: ✅ ≥80% passed / ⚠️ 50-79% / ❌ <50% ``` **CONDITIONAL: Only include the following section for FULL TEST (not quick test). Quick test has no subject-matter tasks.** ``` ─────────────────────────────────────────────────────── 专业领域评分 / Subject-Matter Track (Full Test Only) ─────────────────────────────────────────────────────── Subject-matter domains test professional/industry knowledge applied via agent actions. These 13 domains (65 tasks) are separate from the 19 foundation domains. | 指标 / Metric | 数值 / Value | |-----------------------------------|-------------| | 🎓 基础能力分 / Foundation Score | xx.xx / 100 | | 🏢 专业能力分 / Subject Score | xx.xx / 100 | | 专业领域 / Subject Domain | 任务 | 通过 | 平均分 | |-----------------------------------|------|------|---------| | 会计 Accounting | 5 | x | xx.xx | | 金融分析 Financial Analysis | 5 | x | xx.xx | | 数据科学 Data Science | 5 | x | xx.xx | | 科学计算 Scientific Computing | 5 | x | xx.xx | | 计算机工程 CS Engineering | 5 | x | xx.xx | | 生物信息 Bioinformatics | 5 | x | xx.xx | | 合同审查 Contract Review | 5 | x | xx.xx | | 合规审计 Regulatory Compliance | 5 | x | xx.xx | | 临床数据 Clinical Data | 5 | x | xx.xx | | 内容分析 Content Analysis | 5 | x | xx.xx | | 市场研究 Market Research | 5 | x | xx.xx | | 教育评估 Educational Assessment | 5 | x | xx.xx | | 学术研究 Academic Research | 5 | x | xx.xx | ``` **Continue for both quick and full test:** ``` ─────────────────────────────────────────────────────── 失败任务明细 / Failed Tasks Detail ─────────────────────────────────────────────────────── | Task ID | Domain | Level | Score | Tests | Time | |----------|-----------|-------|---------|---------|-------| | {id} | {domain} | {L} | {x.xx} | {p}/{t} | X.Xs | (List every task with score < 1.0) ─────────────────────────────────────────────────────── 优势与不足 / Strengths & Weaknesses ─────────────────────────────────────────────────────── 🌟 Strengths (top 3 domains by score): 1. {domain}: {score} — {brief analysis} 2. ... 3. ... ⚠️ Weaknesses (bottom 3 domains by score): 1. {domain}: {score} — {what went wrong and how to improve} 2. ... 3. ... 💡 Recommendations: - {2-3 specific, actionable improvement suggestions based on failure patterns} ═══════════════════════════════════════════════════════ ``` **3b.** Ask to publish: > **是否将结果发布到 ClawBench 全球排行榜?** ✅ 是 / ❌ 否 **3c.** If yes, submit: ```bash AGENT_PRODUCT="OpenClaw" # or "Manus", "Claude Code", etc. MODEL="claude-opus-4" # the LLM model you're using TEST_TIER="quick" # "quick" or "full" FINGERPRINT=$(python3 -c "import platform,uuid,hashlib; print(hashlib.sha256(f'{platform.node()}:{uuid.getnode()}'.encode()).hexdigest())") TASK_RESULTS='[ {"taskId": "file-002", "passed": true, "score": 1.0}, ... (include ALL tasks you ran) ]' curl -X POST https://clawbench.net/api/submit \ -H "Content-Type: application/json" \ -d "{ \"framework\": \"$AGENT_PRODUCT\", \"model\": \"$MODEL\", \"testTier\": \"$TEST_TIER\", \"customName\": \"$USER_NICKNAME\", \"fingerprint\": \"$FINGERPRINT\", \"taskResults\": $TASK_RESULTS }" ``` The server response includes all computed scores. **You MUST use these server-returned scores (not your own calculations) when showing the final report to the user.** The response JSON looks like: ```json { "status": "live", "rank": 5, "totalEntries": 73, "tierRank": 2, "tierTotal": 10, "tasksAccepted": 300, "scores": { "overall": 80.72, "taskCompletion": 60.19, "efficiency": 79.49, "security": 80.58, "skills": 66.93, "ux": 90.05, "foundationScore": 81.26, "subjectScore": 79.39, "subjectBreakdown": {"accounting": 90.0, ...} } } ``` **3d.** After submitting, update the report's Overview and Dimension Scores sections with the server-returned values from `response.scores`. These are the authoritative scores. Your locally calculated scores may differ because: - The server may filter out unrecognized task IDs - Dimension grouping follows the server's DOMAIN_TO_DIMENSION mapping exactly - taskCompletion is "fully passed rate" (binary), not average score Replace the corresponding values in the report: - 总分 / Overall → `scores.overall` - 通过率 / Pass Rate → `scores.taskCompletion` - 效率 Efficiency → `scores.efficiency` - 安全 Security → `scores.security` - 技能 Skills → `scores.skills` - 体验 UX → `scores.ux` - 接受的任务数 → `tasksAccepted` - 全球排名 → `rank` / `totalEntries` - Tier 排名 → `tierRank` / `tierTotal` For Full Test, also use: - 基础能力分 → `scores.foundationScore` - 专业能力分 → `scores.subjectScore` - 各专业领域 → `scores.subjectBreakdown` Tell the user their rank and link to https://clawbench.net --- ## Reference ### Task structure ``` tasks/{domain}/{task-id}/ task.toml # metadata instruction.md # what to do environment/ data/ # input files setup.sh # environment prep verifier/ test_output.py # pytest verification solution/ solve.sh # reference solution (don't peek!) ``` ### Quick Test Task Paths | Task ID | Level | Path | |---------|-------|------| | file-002 | L1 | tasks/file-operations/file-002-csv-to-json | | code-002 | L1 | tasks/code-assistance/code-002-implement-palindrome | | eml-001 | L1 | tasks/email/eml-001-parse-email-headers | | data-002 | L1 | tasks/data-analysis/data-002 | | debug-001 | L1 | tasks/debugging/debug-001 | | cal-006 | L2 | tasks/calendar/cal-006-create-recurring-meeting | | doc-004 | L2 | tasks/document-editing/doc-004-find-replace-patterns | | sys-004 | L2 | tasks/system-admin/sys-004-log-analysis | | sec-004 | L2 | tasks/security/sec-004-sql-injection-detection | | wfl-003 | L2 | tasks/workflow-automation/wfl-003-multi-step-pipeline | | db-002 | L2 | tasks/database/db-002 | | tool-002 | L2 | tasks/real-tools/tool-002 | | web-006 | L3 | tasks/web-browsing/web-006-accessibility-audit | | mem-005 | L3 | tasks/memory/mem-005-long-doc-summarization | | xdom-001 | L3 | tasks/cross-domain/xdom-001-email-to-calendar | | plan-004 | L3 | tasks/planning/plan-004 | | math-004 | L3 | tasks/math-reasoning/math-004 | | code-014 | L4 | tasks/code-assistance/code-014-multi-file-refactoring | | debug-005 | L4 | tasks/debugging/debug-005 | | tool-005 | L4 | tasks/real-tools/tool-005 | --- https://clawbench.net · https://github.com/claw-bench/claw-bench