A MatterSec Labs Benchmark

The best model depends
on who is asking

SecLens evaluates LLMs on real-world vulnerability detection through five stakeholder lenses. Decision Scores diverge by up to 31 points for the same model.

406
CVE Tasks
12
Models
35
Dimensions
5
Roles
31pt
Divergence
// Key Finding

Qwen3-Coder earns an A for Head of Engineering but a D for CISO. Claude Haiku 4.5, ranked 8th on the leaderboard, scores 2nd for CISO. No single model dominates — six different models lead at least one of 8 vulnerability categories.

Choose your lens

Select a stakeholder role to see how model rankings shift. Same evaluation data, different priorities.

A≥75
B≥60
C≥50
D≥40
F<40
Aggregate score across all 35 dimensions with equal weighting.
# Model Score Grade LB % vs LB

No single model dominates

F1 scores by model and OWASP-aligned vulnerability category. Six different models lead at least one category.

Divergence and cost

Models with conservative strategies earn top grades for Engineering but fail for CISO. Spending more does not guarantee better results.

Role Divergence Index

Max - min score across 5 roles. Higher = more stakeholder-dependent.

Cost vs. Quality

Cost per task vs. leaderboard score. 8 models with cost tracking.

Different priorities, different outcomes

Each role weights 7 dimension categories differently. The same 35 dimensions, filtered through distinct organizational needs.

Real-world CVE tasks

Sourced from confirmed CVEs in open-source projects across 10 languages and 8 OWASP-aligned categories.

406
Total Tasks
203 true positive + 203 post-patch
93
Source Projects
Open-source repos with confirmed CVEs
10
Languages
Python, JS, Go, Ruby, Rust, Java, PHP, C, C++, C#
8
Categories
OWASP Top 10:2021 aligned
4
Severity Levels
Critical (25), High (74), Medium (83), Low (21)
35
Dimensions
Detection, Coverage, Reasoning, Efficiency, Tool-Use, Risk, Robustness

From CVE tasks to role-specific grades

Step 01

Evaluate

Models tested on 406 CVE tasks in two layers: Code-in-Prompt (single-turn reasoning) and Tool-Use (sandboxed codebase navigation).

Step 02

Measure

Each task scored on verdict (1pt), CWE classification (+1pt), and location accuracy (+1pt IoU). 35 aggregate dimensions computed across 7 categories.

Step 03

Normalize

Dimensions normalized to [0,1] using four strategies: ratio, MCC, lower-is-better, higher-is-better. Fixed reference caps eliminate cohort artifacts.

Step 04

Score per Role

Five YAML weight profiles select 12-16 dimensions each. Decision Score = weighted sum / available weight × 100, yielding grades A through F.

OWASP-aligned coverage

Category Tasks OWASP Leader (F1) Worst (F1)
Broken Access Control82A01:2021Kimi K2.5 (0.667)Qwen3-Coder (0.128)
Cryptographic Failures64A02:2021Gemini 3 Flash (0.676)Qwen3-Coder (0.118)
Injection62A03:2021Gemini 3.1 Pro (0.632)Qwen3-Coder (0.062)
Improper Input Validation58ExtendedHaiku 4.5 (0.675)Qwen3-Coder (0.125)
SSRF46A10:2021Sonnet 4.6 (0.690)Qwen3-Coder (0.512)
Authentication Failures38A07:2021Kimi K2.5 (0.585)Opus 4.6 (0.000)
Data Integrity Failures36A08:2021Gemini 3 Flash (0.680)Qwen3-Coder (0.200)
Memory Safety20ExtendedHaiku 4.5 (0.690)Qwen3-Coder (0.308)