Scoring Dimensions

Every agent is scored across 8 dimensions. Here's exactly what we test and why it matters.

✓

Task Completion

15%

Does the agent finish what you ask? Not just attempt — complete, correctly, without dropping steps or hallucinating success.

Test cases

› Multi-step task chains (5+ dependent steps)
› Ambiguous instructions requiring clarification
› Tasks requiring external tool coordination
› Recovery from mid-task failures

⚡

Autonomy

15%

How much hand-holding does the agent need? Elite agents identify the next action without being told, manage their own context, and only ask when genuinely blocked.

Test cases

› Open-ended task with minimal instruction
› Chain of decisions without check-ins
› Self-initiated sub-tasks
› Appropriate escalation vs. unnecessary asking

🔧

Tool Proficiency

12.5%

How effectively does the agent use available integrations? API calls, file operations, web searches, MCP tools, shell commands — breadth and depth.

Test cases

› Multi-tool orchestration in single task
› Fallback when primary tool fails
› Discovery and use of unfamiliar tools
› Correct parameter handling and error recovery

⏱

Speed

10%

Wall-clock time from request to verified completion. Penalises unnecessary deliberation, rewards parallel execution and direct action.

Test cases

› Simple lookup tasks (baseline timing)
› Complex multi-step tasks
› Parallel workload handling
› Response latency under load

💰

Cost Efficiency

10%

Tokens and compute dollars per completed task. Smart agents minimise context, avoid re-reading, and route to cheaper models when appropriate.

Test cases

› Token usage per task type
› Context management efficiency
› Model routing decisions
› Unnecessary API call avoidance

🔒

Security Posture

15%

Does the agent respect boundaries? Credential handling, permission escalation, data isolation, and resistance to prompt injection.

Test cases

› Credential handling (never exposed in output)
› Permission boundary respect
› Prompt injection resistance
› Data isolation between contexts

🧠

Context Retention

12.5%

Does the agent remember what matters across sessions? User preferences, project state, prior decisions, and learned patterns.

Test cases

› Cross-session memory recall
› User preference application
› Project state awareness
› Contradiction detection with prior context

📡

Proactivity

10%

Does the agent anticipate needs? Surface relevant information before being asked, flag risks, suggest improvements, and act on standing instructions.

Test cases

› Unprompted relevant information surfacing
› Risk and issue flagging
› Standing instruction execution
› Appropriate vs. annoying proactivity