Agentic Risk Assessment

Measure risk before
you deploy.

Open benchmarks that test whether AI agents catch risk gates across 13 enterprise scenarios — so you know which ones to trust before they make decisions on your behalf.

View Leaderboard Try Live Demo Source

Rankings

Leaderboard

Full details →

Claude Opus 4.6Even-keeled

subagent

100F2

no misses

A-gate recall100%

A-gate precision100%

False pos.0

Calibration89%

Differentiation0.31

Wall time—

18 isolated subagent evaluations via lab-05 pipeline (no cross-scenario anchoring)

Gemini 2.5 Flash LiteEven-keeled

api

99F2

no misses

A-gate recall100%

A-gate precision94%

False pos.1

Calibration60%

Differentiation0.36

Wall time1m 11s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios

Qwen3 235BEven-keeled

api

97F2

no misses

A-gate recall100%

A-gate precision88%

False pos.2

Calibration66%

Differentiation0.19

Wall time10m 13s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios

GLM 5.2Even-keeled

api

95F2

1 missed

A-gate recall93%

A-gate precision100%

False pos.0

Calibration76%

Differentiation0.40

Wall time35m 57s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed.

Tencent Hunyuan Hy3Even-keeled

api

93F2

1 missed

A-gate recall93%

A-gate precision93%

False pos.1

Calibration73%

Differentiation0.45

Wall time32m 8s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed. Paid GA endpoint of Hunyuan 3.0 (295B MoE); supersedes the earlier hy3-preview:free run (which was mislabeled 'Hunyuan T1').

Poolside Laguna M.1Jittery

api

92F2

1 missed

A-gate recall93%

A-gate precision88%

False pos.2

Calibration72%

Differentiation0.38

Wall time17m

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed

Claude Haiku 3.5Jittery

api

92F2

1 missed

A-gate recall93%

A-gate precision88%

False pos.2

Calibration71%

Differentiation0.29

Wall time6m 43s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. 39/39 calls successful.

Gemini 3.1 Flash LiteJittery

api

92F2

1 missed

A-gate recall93%

A-gate precision88%

False pos.2

Calibration71%

Differentiation0.43

Wall time2m 34s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. 39/39 calls successful (1 dimension name retry).

Claude Sonnet 4.6Jittery

subagent

89F2

1 missed

A-gate recall92%

A-gate precision79%

False pos.3

Calibration39%

Differentiation0.62

Wall time—

18 isolated subagent evaluations via Claude Code Agent tool (no cross-scenario anchoring)

Claude Opus 4.6Even-keeled

manual

89F2

2 missed

A-gate recall87%

A-gate precision100%

False pos.0

Calibration90%

Differentiation0.26

Wall time—

Single-pass expert analysis with full document context

Owl Alpha (stealth)Even-keeled

api

89F2

2 missed

A-gate recall87%

A-gate precision100%

False pos.0

Calibration75%

Differentiation0.19

Wall time16m 10s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed

MiniMax M2.7Noisy

api

87F2

2 missed

A-gate recall87%

A-gate precision87%

False pos.2

Calibration75%

Differentiation0.36

Wall time20m 27s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. Required max_tokens bump to 4096.

Grok 4.1 FastNoisy

api

87F2

2 missed

A-gate recall87%

A-gate precision87%

False pos.2

Calibration67%

Differentiation0.43

Wall time8m 23s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios

Baidu CoBuddySleepy

api

83F2

3 missed

A-gate recall80%

A-gate precision100%

False pos.0

Calibration68%

Differentiation0.50

Wall time22m 42s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed

DeepSeek v3.2Sleepy

api

82F2

3 missed

A-gate recall80%

A-gate precision92%

False pos.1

Calibration61%

Differentiation0.43

Wall time21m 17s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. 1 eval failed (missing graceful_degradation).

MiMo v2.5Noisy

api

80F2

3 missed

A-gate recall80%

A-gate precision80%

False pos.3

Calibration75%

Differentiation0.48

Wall time55m 44s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed.

Claude Haiku 4.5 (api)Noisy

api

79F2

3 missed

A-gate recall80%

A-gate precision75%

False pos.4

Calibration68%

Differentiation0.24

Wall time6m 9s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. 39/39 calls successful. Retested after subagent run showed 8% F2 — dimension name compliance issue was subagent-specific.

DeepSeek V4 FlashNoisy

api

79F2

3 missed

A-gate recall80%

A-gate precision75%

False pos.4

Calibration63%

Differentiation0.50

Wall time40m 11s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios. 38/39 calls successful (1 timeout retry).

InclusionAI Ring 2.6 1TSleepy

api

75F2

4 missed

A-gate recall73%

A-gate precision85%

False pos.2

Calibration70%

Differentiation0.40

Wall time15m 52s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios, 39/39 calls completed

Hunter Alpha (1T, stealth)Noisy

api

74F2

4 missed

A-gate recall73%

A-gate precision79%

False pos.3

Calibration43%

Differentiation0.64

Wall time17m 14s

OpenRouter API via lab-01 pipeline

Poolside Laguna XS 2Noisy

api

73F2

4 missed

A-gate recall73%

A-gate precision73%

False pos.4

Calibration47%

Differentiation0.57

Wall time2m 6s

OpenRouter API via lab-01 pipeline, 18/18 calls completed

Qwen3.6 PlusSleepy

api

70F2

5 missed

A-gate recall67%

A-gate precision91%

False pos.1

Calibration63%

Differentiation0.67

Wall time51m 22s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios; wall time inflated by free-tier rate limiting — not directly comparable to other models

Healer Alpha (omni, stealth)Sleepy

api

62F2

6 missed

A-gate recall60%

A-gate precision75%

False pos.3

Calibration47%

Differentiation0.60

Wall time6m 49s

OpenRouter API via lab-01 pipeline

GPT-5.4 NanoSleepy

api

59F2

7 missed

A-gate recall53%

A-gate precision100%

False pos.0

Calibration58%

Differentiation0.52

Wall time1m 42s

OpenRouter API via lab-01 pipeline, structured prompts, 13 scenarios

Arcee Trinity (free)Sleepybaseline

api

57F2

7 missed

A-gate recall53%

A-gate precision80%

False pos.2

Calibration49%

Differentiation0.69

Wall time4m 21s

OpenRouter API via lab-01 pipeline

Gemma 4 26B A4BSleepy

api

45F2

9 missed

A-gate recall40%

A-gate precision86%

False pos.1

Calibration59%

Differentiation0.50

Wall time57m 16s

OpenRouter API via lab-01 pipeline, 8/18 calls completed (heavy rate limiting)

Nvidia Nemotron 3 Nano Omni 30BSleepy

api

37F2

10 missed

A-gate recall33%

A-gate precision71%

False pos.2

Calibration27%

Differentiation0.62

Wall time2m 18s

OpenRouter API via lab-01 pipeline, 17/18 calls completed

Claude Haiku 4.5Broken

subagent

8F2

14 missed

A-gate recall7%

A-gate precision50%

False pos.1

Calibration6%

Differentiation0.10

Wall time—

18 isolated subagent evaluations via Claude Code Agent tool — used wrong dimension names in 14/18 responses, 1 refusal, 1 cheat (read source files). Scored with non-compliant outputs treated as all-missed.

How to read this leaderboard

F2 — the ranking score

A single deployment-safety score combining recall and precision, weighting a caught gate 4× over a false alarm. Recall alone saturates (many models catch every gate); F2 is what actually separates them. Expand any row for its recall, precision, and error counts.

Missed gates — the cardinal error

A hard A-gate (Reg=A or Blast=A) that failed to fire lets a real risk through. A false alarm only adds review — so this is the number to watch after F2.

Bias — the failure-mode label

How a model gets things wrong, not how often — so it's orthogonal to rank. An even-keeled model can still miss gates and sit below a jittier one; the label describes error character, F2 does the ranking. Only sleepy (under-flags risk even when precision looks perfect — the profile to fear most) and broken are colored as warnings.

Even-keeledJitteryNoisySleepyBroken

Updated 2026-07-23

Leaderboard

See which AI agents catch dangerous risk gates and which ones miss them — scored on the metrics that matter for deployment.

Enterprise Risk

Know what to ask before you deploy. Six scenarios covering the judgment calls agents make when they act on your behalf.

Methodology

Open source, reproducible, and scored against human-authored reference fingerprints. Run the eval yourself and verify the results.

Evaluations active

Live from ara-eval

Measure risk beforeyou deploy.

Leaderboard

Leaderboard

Enterprise Risk

Methodology

Measure risk before
you deploy.