← Back to context

Comment by enisberk

4 days ago

Hi d_burfoot, really appreciate you bringing that up! The idea of pre-training a big foundation model on our raw data using self-supervised learning (SSL) methods (kind of like how GPT emerged in NLP) is definitely something we've considered and experimented with using transformer architectures.

The main hurdle we've hit is honestly the scale of relevant data needed to train such large models from scratch effectively. While our ~19.5 years dataset duration is massive for ecoacoustics, a significant portion of it is silence or ambient noise. This means the actual volume of distinct events or complex acoustic scenes is much lower compared to the densely packed information in the corpora typically used to train foundational speech or general audio models, making our effective dataset size smaller in that context.

We also tried leveraging existing pre-trained SSL models (like Wav2Vec 2.0, HuBERT for speech), but the domain gap is substantial. As you can imagine, raw ecoacoustic field recordings are characterized by significant non-stationary noise, overlapping sounds, sparse events we care about mixed with lots of quiet/noise, huge diversity, and variations from mics/weather.

This messes with the SSL pre-training tasks themselves. Predicting masked audio doesn't work as well when the surrounding context is just noise, and the data augmentations used in contrastive learning can sometimes accidentally remove the unique signatures of the animal calls we're trying to learn.

It's definitely an ongoing challenge in the field! People are trying different things, like initializing audio transformers with weights pre-trained on image models (ViT adapted for spectrograms) to give them a head start. Finding the best way forward for large models in these specialized, data-constrained domains is still key. Thanks again for the suggestion, it really hits on a core challenge!

Do the recorders have overlapping detections?

  • If you’re asking whether multiple recorders were active at the same time, then yes, we had recorders at 98 different locations over four years, primarily during the summer months. However, these locations were far apart, so no two recorders captured the same exact area.