ArXiv TLDR

Seclens: Role-specific Evaluation of LLM's for security vulnerablity detection

🐦 Tweet
2604.01637

Subho Halder, Siddharth Saxena, Kashinath Kadaba Shrish, Thiyagarajan M

cs.CRcs.AI

TLDR

SecLens-R introduces a multi-stakeholder framework to evaluate LLMs for security vulnerability detection, revealing significant performance differences across roles.

Key contributions

  • Introduces SecLens-R, a multi-stakeholder evaluation framework for LLM-based vulnerability detection.
  • Defines 5 role-specific weighting profiles (e.g., CISO, Head of Engineering) with composite Decision Scores.
  • Evaluates 12 frontier LLMs on 406 tasks across 10 languages and 8 OWASP categories.
  • Reveals up to 31-point performance disparities for the same model across different stakeholder profiles.

Why it matters

Current LLM vulnerability detection benchmarks are too simplistic, failing to capture diverse stakeholder needs. This paper highlights that a model performing well for an engineering leader might be inadequate for a CISO. It provides a nuanced evaluation method, SecLens-R, crucial for deploying LLMs responsibly in security.

Original Abstract

Existing benchmarks for LLM-based vulnerability detection compress model performance into a single metric, which fails to reflect the distinct priorities of different stakeholders. For example, a CISO may emphasize high recall of critical vulnerabilities, an engineering leader may prioritize minimizing false positives, and an AI officer may balance capability against cost. To address this limitation, we introduce SecLens-R, a multi-stakeholder evaluation framework structured around 35 shared dimensions grouped into 7 measurement categories. The framework defines five role-specific weighting profiles: CISO, Chief AI Officer, Security Researcher, Head of Engineering, and AI-as-Actor. Each profile selects 12 to 16 dimensions with weights summing to 80, yielding a composite Decision Score between 0 and 100. We apply SecLens-R to evaluate 12 frontier models on a dataset of 406 tasks derived from 93 open-source projects, covering 10 programming languages and 8 OWASP-aligned vulnerability categories. Evaluations are conducted across two settings: Code-in-Prompt (CIP) and Tool-Use (TU). Results show substantial variation across stakeholder perspectives, with Decision Scores differing by as much as 31 points for the same model. For instance, Qwen3-Coder achieves an A (76.3) under the Head of Engineering profile but a D (45.2) under the CISO profile, while GPT-5.4 shows a similar disparity. These findings demonstrate that vulnerability detection is inherently a multi-objective problem and that stakeholder-aware evaluation provides insights that single aggregated metrics obscure.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.