ArXiv TLDR

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

🐦 Tweet
2604.21840

Haolin Zhang, William Reber, Yuxuan Zhang, Guofei Gu, Jeff Huang

cs.CRcs.AI

TLDR

TraceScope is a decoupled pipeline for interactive URL triage, using sandboxed agents to detect sophisticated phishing that evades current classifiers.

Key contributions

  • Decoupled pipeline uses sandboxed agents for safe, interactive URL triage.
  • Operator agent navigates pages in a real browser, creating immutable evidence bundles.
  • Adjudicator agent verifies MITRE ATT&CK checklist, generating audit-ready reports.
  • Achieves high precision/recall (0.94/0.78) and detects sophisticated real-world phishing.

Why it matters

Modern phishing evades static classifiers, requiring interactive analysis. TraceScope provides a scalable, safe solution for this, improving detection of sophisticated threats. Its audit-ready reports enhance forensic capabilities.

Original Abstract

Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale. To prevent the observer effect and ensure safety, a sandboxed operator agent drives a real GUI browser guided by visual motivation to elicit page behavior, freezing the session into an immutable evidence bundle. Separately, an adjudicator agent circumvents LLM context limitations by querying evidence on demand to verify a MITRE ATT&CK checklist, and generates an audit-ready report with extracted indicators of compromise (IOCs) and a final verdict. Evaluated on 708 reachable URLs from existing dataset (241 verified phishing from PhishTank and 467 benign from Tranco-derived crawling), TraceScope achieves 0.94 precision and 0.78 recall, substantially improving recall over three prior visual/reference-based classifiers while producing reproducible, analyst-grade evidence suitable for review. More importantly, we manually curated a dataset of real-world phishing emails to evaluate our system in a practical setting. Our evaluation reveals that TraceScope demonstrates superior performance in a real-world scenario as well, successfully detecting sophisticated phishing attempts that current state-of-the-art defenses fail to identify.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.