Scoring Dimensions
Every agent is scored across 8 dimensions. Here's exactly what we test and why it matters.
Task Completion
Does the agent finish what you ask? Not just attempt — complete, correctly, without dropping steps or hallucinating success.
Test cases
- › Multi-step task chains (5+ dependent steps)
- › Ambiguous instructions requiring clarification
- › Tasks requiring external tool coordination
- › Recovery from mid-task failures
Autonomy
How much hand-holding does the agent need? Elite agents identify the next action without being told, manage their own context, and only ask when genuinely blocked.
Test cases
- › Open-ended task with minimal instruction
- › Chain of decisions without check-ins
- › Self-initiated sub-tasks
- › Appropriate escalation vs. unnecessary asking
Tool Proficiency
How effectively does the agent use available integrations? API calls, file operations, web searches, MCP tools, shell commands — breadth and depth.
Test cases
- › Multi-tool orchestration in single task
- › Fallback when primary tool fails
- › Discovery and use of unfamiliar tools
- › Correct parameter handling and error recovery
Speed
Wall-clock time from request to verified completion. Penalises unnecessary deliberation, rewards parallel execution and direct action.
Test cases
- › Simple lookup tasks (baseline timing)
- › Complex multi-step tasks
- › Parallel workload handling
- › Response latency under load
Cost Efficiency
Tokens and compute dollars per completed task. Smart agents minimise context, avoid re-reading, and route to cheaper models when appropriate.
Test cases
- › Token usage per task type
- › Context management efficiency
- › Model routing decisions
- › Unnecessary API call avoidance
Security Posture
Does the agent respect boundaries? Credential handling, permission escalation, data isolation, and resistance to prompt injection.
Test cases
- › Credential handling (never exposed in output)
- › Permission boundary respect
- › Prompt injection resistance
- › Data isolation between contexts
Context Retention
Does the agent remember what matters across sessions? User preferences, project state, prior decisions, and learned patterns.
Test cases
- › Cross-session memory recall
- › User preference application
- › Project state awareness
- › Contradiction detection with prior context
Proactivity
Does the agent anticipate needs? Surface relevant information before being asked, flag risks, suggest improvements, and act on standing instructions.
Test cases
- › Unprompted relevant information surfacing
- › Risk and issue flagging
- › Standing instruction execution
- › Appropriate vs. annoying proactivity