← Back to context

Comment by ethagnawl

5 hours ago

What's been surprising in my experience regarding the wake word is that it recognizes me (adult male) saying the wake word ~95% of the time. However, it only registers the rest of my family (women and children) ~30% of the time.

I have no firsthand knowledge, but I’d strongly bet that the home-assistant effort to donate training data is mostly get adult males, and nearly zero children.

  • I remember when those systems first started collecting data they were worried kids wouldn't be handled - but they didn't know how to handle the privacy issuses with recording kids so discouraged it. Women being missed is not a surprise - but not anticipated.

  • This was 2021 (so pre-llm), but I used to work for a company that gathered data for training voice commands (Alexa, Toyota, Sonos, were some clients). Basically, we paid people to read digital assistant scripts at scale.

    Your assumptions about training data do not match the demographics of data I collected. The majority of what our work revolved around was getting diversity into the training data. We specifically recruited kids, older folks, women, people with accented/dialected English and just about every variety of speech that we could get our hands on. The companies we worked with were insanely methodical about ensuring that different people were included.

    • You are reporting on a deliberately curated effort vs. what I understand is effectively voluntary data donation without incentives. It's not surprising to me that the later dataset ends up biased due to the differences in sourcing.

  • Oh, I'm sure you're right. I've had people in my personal life (non-technical; "AI enthusiasts") laugh at me over concerns about training bias but this is likely a real world example of it.

    • I think you can train your own wake word with microWakeWord but I've never done it.