Identification and Anonymization of Named Entities in Unstructured Information Sources for Use in Social Engineering Detection
Carlos Jimeno Miguel, Raul Orduna, Francesco Zola
TLDR
This paper proposes a system to identify and anonymize named entities in unstructured data from Telegram for social engineering detection, ensuring GDPR compliance.
Key contributions
- Proposed a system to collect and process unstructured data (text, audio, images) from Telegram.
- Evaluated STT models (Parakeet best) and NER solutions, achieving high f1-scores for sensitive data.
- Introduced anonymization metrics to preserve data coherence while protecting personal information.
- Enables creation of GDPR-compliant datasets for social engineering detection research.
Why it matters
This paper addresses the critical challenge of using real-world data for cybercrime analysis without violating privacy laws like GDPR. By providing a robust anonymization framework, it enables ethical and legal research into social engineering, fostering safer digital environments.
Original Abstract
This study addresses the challenge of creating datasets for cybercrime analysis while complying with the requirements of regulations such as the General Data Protection Regulation (GDPR) and Organic Law 10/1995 of the Penal Code. To this end, a system is proposed for collecting information from the Telegram platform, including text, audio, and images; the implementation of speech-to-text transcription models incorporating signal enhancement techniques; and the evaluation of different Named Entity Recognition (NER) solutions, including Microsoft Presidio and AI models designed using a transformer-based architecture. Experimental results indicate that Parakeet achieves the best performance in audio transcription, while the proposed NER solutions achieve the highest f1-score values in detecting sensitive information. In addition, anonymization metrics are presented that allow evaluation of the preservation of structural coherence in the data, while simultaneously guaranteeing the protection of personal information and supporting cybersecurity research within the current legal framework.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.