Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

April 9, 20262604.08140

Longgang Zhang, Xiaowei Fu, Fuxiang Huang, Lei Zhang

cs.CRcs.AIcs.MMcs.NI

TLDR

This paper introduces BGTD, a new benchmark, and mmTraffic, an LLM-based multimodal framework for explainable encrypted network traffic interpretation.

Key contributions

Proposes Byte-Grounded Traffic Description (BGTD), a new benchmark for multimodal, explainable traffic analysis.
Introduces mmTraffic, an end-to-end multimodal LLM framework for traffic-language representation.
Utilizes a jointly-optimized perception-cognition architecture to mitigate modality interference and hallucinations.
Generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports.

Why it matters

Existing network traffic analysis methods are black boxes, lacking explainable reasoning and rich semantic understanding. This paper introduces a benchmark and an LLM-based framework that enables auditable, human-readable interpretation of encrypted traffic. This advances security analysis by providing verifiable evidence.

Original Abstract

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers