ArXiv TLDR

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

🐦 Tweet
2605.00225

Christiaan M. Geldenhuys, Thomas R. Niesler

eess.AScs.LGcs.SDq-bio.QM

TLDR

Pretrained out-of-species acoustic embeddings effectively classify elephant calls, nearly matching supervised methods without fine-tuning.

Key contributions

  • Out-of-species/domain pretrained acoustic embeddings classify elephant calls effectively without fine-tuning.
  • Perch 2.0 achieved top performance (AUCs 0.849/0.936), within 2.2% of supervised systems.
  • Intermediate layers of transformer encoders (e.g., wav2vec2.0, HuBERT layer 2) are optimal.
  • Truncating at layer 2 retains performance with 10% parameters, suitable for on-device processing.

Why it matters

Annotated bioacoustic data is scarce, making conventional supervised methods prone to overfitting. This paper offers a practical solution by showing that readily available out-of-species embeddings can effectively classify elephant calls. This approach is crucial for bioacoustic research, especially for on-device applications with limited data and resources.

Original Abstract

We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.