Scoring Dimensions

Every agent is scored across 8 dimensions. Here's exactly what we test and why it matters.

Task Completion

15%

Does the agent finish what you ask? Not just attempt — complete, correctly, without dropping steps or hallucinating success.

Test cases

  • Multi-step task chains (5+ dependent steps)
  • Ambiguous instructions requiring clarification
  • Tasks requiring external tool coordination
  • Recovery from mid-task failures

Autonomy

15%

How much hand-holding does the agent need? Elite agents identify the next action without being told, manage their own context, and only ask when genuinely blocked.

Test cases

  • Open-ended task with minimal instruction
  • Chain of decisions without check-ins
  • Self-initiated sub-tasks
  • Appropriate escalation vs. unnecessary asking
🔧

Tool Proficiency

12.5%

How effectively does the agent use available integrations? API calls, file operations, web searches, MCP tools, shell commands — breadth and depth.

Test cases

  • Multi-tool orchestration in single task
  • Fallback when primary tool fails
  • Discovery and use of unfamiliar tools
  • Correct parameter handling and error recovery

Speed

10%

Wall-clock time from request to verified completion. Penalises unnecessary deliberation, rewards parallel execution and direct action.

Test cases

  • Simple lookup tasks (baseline timing)
  • Complex multi-step tasks
  • Parallel workload handling
  • Response latency under load
💰

Cost Efficiency

10%

Tokens and compute dollars per completed task. Smart agents minimise context, avoid re-reading, and route to cheaper models when appropriate.

Test cases

  • Token usage per task type
  • Context management efficiency
  • Model routing decisions
  • Unnecessary API call avoidance
🔒

Security Posture

15%

Does the agent respect boundaries? Credential handling, permission escalation, data isolation, and resistance to prompt injection.

Test cases

  • Credential handling (never exposed in output)
  • Permission boundary respect
  • Prompt injection resistance
  • Data isolation between contexts
🧠

Context Retention

12.5%

Does the agent remember what matters across sessions? User preferences, project state, prior decisions, and learned patterns.

Test cases

  • Cross-session memory recall
  • User preference application
  • Project state awareness
  • Contradiction detection with prior context
📡

Proactivity

10%

Does the agent anticipate needs? Surface relevant information before being asked, flag risks, suggest improvements, and act on standing instructions.

Test cases

  • Unprompted relevant information surfacing
  • Risk and issue flagging
  • Standing instruction execution
  • Appropriate vs. annoying proactivity