Comment by enisberk
4 days ago
Finishing up my PhD thesis on low-resource audio classification for ecoacoustics. Our partners deployed 98 recorders in remote Arctic/sub-Arctic regions, collecting a massive (~19.5 years) dataset to monitor wildlife and human noise.
Labeled data is the bottleneck, so my work focuses on getting good results with less data. Key parts:
- Created EDANSA [1], the first public dataset of its kind from these areas, using a improved active learning method (ensemble disagreement) to efficiently find rare sounds.
- Explored other low-resource ML: transfer learning, data valuation (using Shapley values), cross-modal learning (using satellite weather data to train audio models), and testing the reasoning abilities of MLLMs on audio (spoiler: they struggle!).
Happy to discuss any part!
[1]https://scholar.google.com/citations?user=AH-sLEkAAAAJ&hl=en
Hi Enis, it seems a very interesting project. I myself with my team are currently working with the non-stationary of physiological and earthquake seismic public data mainly based on the time-frequency distributions, and the results are very promising.
Just wondering if the raw data that you've mentioned are available publicly so we can test our techniques on them or they're only available through research collaborations. Either way very much interested on the potential use of our techniques for the polar research in Arctic and/or Antarctica.
Hi teleforce, thanks! Your project sounds very interesting as well.
That actually reminds me, at one point, a researcher suggested looking into geophone or fiber optic Distributed Acoustic Sensing (DAS) data that oil companies sometimes collect in Alaska, potentially for tracking animal movements or impacts, but I never got the chance to follow up. Connecting seismic activity data (like yours) with potential effects on animal vocalizations or behaviour observed in acoustic recordings would be an interesting research direction!
Regarding data access:
Our labeled dataset (EDANSA, focused on specific sound events) is public here: https://zenodo.org/records/6824272. We will be releasing an updated version with more samples soon.
We are also actively working on releasing the raw, continuous audio recordings. These will eventually be published via the Arctic Data Center (arcticdata.io). If you'd like, feel free to send me an email (address should be in my profile), and I can ping you when that happens.
Separately, we have an open-source model (with updates coming) trained on EDANSA for predicting various animal sounds and human-generated noise. Let me know if you'd ever be interested in discussing whether running that model on other types of non-stationary sound data you might have access to could be useful or yield interesting comparisons.
You should train a GPT on the raw data, and then figure out how to reuse the DNN for various other tasks you're interested in (e.g. one-shot learning, fine-tuning, etc). This data setting is exactly the situation that people faced in the NLP world before GPT. I would guess that some people from the frontier labs would be willing to help you, I doubt even your large dataset would cost very much for their massive GPU fleets to handle.
Hi d_burfoot, really appreciate you bringing that up! The idea of pre-training a big foundation model on our raw data using self-supervised learning (SSL) methods (kind of like how GPT emerged in NLP) is definitely something we've considered and experimented with using transformer architectures.
The main hurdle we've hit is honestly the scale of relevant data needed to train such large models from scratch effectively. While our ~19.5 years dataset duration is massive for ecoacoustics, a significant portion of it is silence or ambient noise. This means the actual volume of distinct events or complex acoustic scenes is much lower compared to the densely packed information in the corpora typically used to train foundational speech or general audio models, making our effective dataset size smaller in that context.
We also tried leveraging existing pre-trained SSL models (like Wav2Vec 2.0, HuBERT for speech), but the domain gap is substantial. As you can imagine, raw ecoacoustic field recordings are characterized by significant non-stationary noise, overlapping sounds, sparse events we care about mixed with lots of quiet/noise, huge diversity, and variations from mics/weather.
This messes with the SSL pre-training tasks themselves. Predicting masked audio doesn't work as well when the surrounding context is just noise, and the data augmentations used in contrastive learning can sometimes accidentally remove the unique signatures of the animal calls we're trying to learn.
It's definitely an ongoing challenge in the field! People are trying different things, like initializing audio transformers with weights pre-trained on image models (ViT adapted for spectrograms) to give them a head start. Finding the best way forward for large models in these specialized, data-constrained domains is still key. Thanks again for the suggestion, it really hits on a core challenge!
Do the recorders have overlapping detections?
1 reply →
oh man that's awesome. I have been working for quite some time on big taxonomy/classification models for field research, espec for my old research area (pollination stuff). the #1 capability that I want to build is audio input modality, it would just be so useful in the field-- not only for low-resource (audio-only) field sensors, but also just as a supplemental modality for measuring activity out of the FoV of an image sensor.
but as you mention, labeled data is the bottleneck. eventually I'll be able to skirt around this by just capturing more video data myself and learning sound features from the video component, but I have a hard time imagining how I can get the global coverage that I have in visual datasets. I would give anything to trade half of my labeled image data for labeled audio data!
Hi Caleb, thanks for the kind words and enthusiasm! You're absolutely right, audio provides that crucial omnidirectional coverage that can supplement fixed field-of-view sensors like cameras. We actually collect images too and have explored fusion approaches, though they definitely come with their own set of challenges, as you can imagine.
On the labeled audio data front: our Arctic dataset (EDANSA, linked in my original post) is open source. We've actually updated it with more samples since the initial release, and getting the new version out is on my to-do list.
Polli.ai looks fantastic! It's genuinely exciting to see more people tackling the ecological monitoring challenge with hardware/software solutions. While I know the startup path in this space can be tough financially, the work is incredibly important for understanding and protecting biodiversity. Keep up the great work!
I'd love to turn my spectrogram tool into something more of a scientific tool for sound labelling and analysis. Do you use a spectrograph for your project?
Hey thePhytochemist, cool tool! Yes, spectrograms are fundamental for us. Audacity is the classic for quick looks. For systematic analysis and ML inputs, it's mostly programmatic generation via libraries like torch.audio or librosa. Spectrograms are a common ML input, though other representations are being explored.
Enhancing frequenSee for scientific use (labelling/analysis) sounds like a good idea. But I am not sure what is missing from the current tooling. What functionalities were you thinking of adding?
How do I download the sounds, seems like a great resource for game developers and other artists
Our labeled dataset (EDANSA, focused on specific sound events) is public here: https://zenodo.org/records/6824272. We will be releasing an updated version with more samples soon.
We are also actively working on releasing the raw, continuous audio recordings. These will eventually be published via the Arctic Data Center (arcticdata.io). If you'd like, feel free to send me an email (address should be in my profile), and I can ping you when that happens.
How can I actually search for audio, I'll check in 6 months or so.
What's the licensing is it public domain
1 reply →
Very interesting. All the best for your thesis. Mine is not nearly as interesting enough.
Thanks! Appreciate it. Your work looks very interesting too, especially in the distributed systems space. Cheers!
[dead]