ArXiv TLDR

Towards Open World Sound Event Detection

🐦 Tweet
2605.03934

P. H. Hai, L. T. Minh, L. H. Son

cs.SDcs.AI

TLDR

This paper introduces Open-World Sound Event Detection (OW-SED) and proposes WOOT, a 1D Deformable Transformer, for detecting known and unseen sound events.

Key contributions

  • Introduces the Open-World Sound Event Detection (OW-SED) paradigm for detecting known, unseen, and learning novel events.
  • Proposes a 1D Deformable architecture leveraging deformable attention to focus on salient temporal regions.
  • Designs WOOT framework with feature disentanglement, one-to-many matching, and a diversity loss.
  • Demonstrates superior performance in open-world scenarios and competitive results in closed-world settings.

Why it matters

Current SED systems are limited by a closed-world assumption, struggling with novel sounds in real-world settings. This paper introduces the OW-SED paradigm and a novel WOOT framework, enabling models to detect and learn new events. This significantly advances audio understanding for practical applications in surveillance and smart cities.

Original Abstract

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.