Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
TLDR
This paper introduces an LLM-based pipeline for taxonomy-agnostic PII annotation in HTTP traffic, addressing data scarcity and evolving privacy definitions.
Key contributions
- Proposes a multi-stage LLM pipeline for PII detection and value extraction in HTTP traffic.
- Enables taxonomy-agnostic PII annotation by providing taxonomies at runtime, boosting flexibility.
- Introduces an LLM-based synthetic HTTP traffic generator for controlled, privacy-safe evaluation.
- Demonstrates accurate PII type detection and value extraction across diverse taxonomies.
Why it matters
Current PII leakage detectors are limited by scarce labeled data and fixed taxonomies. This work offers a flexible, LLM-driven solution that adapts to evolving privacy definitions and reduces reliance on sensitive real-user data. It significantly advances automated privacy auditing.
Original Abstract
Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.