> Additionally, rare hallucinations in Voice Mode persist with this update, resulting in unintended sounds resembling ads, gibberish, or background music. We are actively investigating these issues and working toward a solution.
Would be cool to hear some samples of this. I remember there was some hallucinated background music during the meditation demo in the original reveal livestream but haven't seen much beyond that. Artifact of training on podcasts to get natural intonation.
I use advanced voice a lot and have come across many weird bugs.
1) Every response would be normal except end with a “whoosh” like one of those sound effects some mail clients use when an message is sent, and the model itself either couldn’t or wouldn’t acknowledge it.
2) The same except with someone knocking on a door. Like someone would play on a soundboard.
3) The entire history in the conversation disappearing after several minutes of back and forth, leading to the model having no idea what I’m talking about and acting as if it’s a fresh conversation.
4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting (on a brand new iPhone 16 Pro, medium-low built in speaker volume and built-in mic).
5) Really weird changes in pronunciation or randomly saying certain words high-pitched, or suddenly using a weird accent.
And all of this was prior to these most recent changes.
It also stutters and repeats sometimes and says poor connection even though I know the connection is near-ideal.
I may know why that first one happens! They’re not correctly padding the latent in their decoder (by default torch pads with zeros, they should pad with whatever their latent’s representation of silence is). You can hear the same effect in songs generated with our music model: https://sonauto.ai/
> 4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting
I experience the same issue on an iPhone 15 Pro Max and have to mute the mic whenever I'm listening to a response. I wish they added an option to disable voice interruptions so that it could be interrupted only by touch.
If anyone's wondering, here's a short sample. It quietly updated last night, and I ended up chatting for like an hour. It sounds as smart as before, but like 10x more emotionally intelligent. Laughter is the biggest giveaway, but the serious/empathetic tones for more therapy-like conversations are noticeable, too.
https://drive.google.com/file/d/16kiJ2hQW3KF4IfwYaPHdNXC-rsU...
Holy moley. Thanks for sharing. I had to work with the API version a lot the last week and it was frustrating how "old" it felt intelligence-wise. This is in another league, I hope it's 4.1 x audio training, I'd love to talk to this. Current one is passable for hands-free RAG, that's it for me.
I have the feeling that the Advanced Voice Mode is significantly worse than when I used it earlier this week. The voice sounds disinterested, and has weird intonation. It used to be excellent for foreign language conversation practice, now significantly worse.
Edit: After using up my 15 minutes for testing, I have to say that the new voice is actually not bad, although I was used to something else. But it has a very clear "artificial" quality to it. It also sometimes misinterprets my input as something completely different than what I said, for example "please like my video and subscribe to my channel".
Stumbled across the new voice this afternoon after months of not using voice mode and after being impressed by the naturalness, was also let down by the disinterested tone. That combined with the platitudes and tendency to repeat back to me what I was saying without new information left me disappointed with the update.
In the Plus subscription yes. You can also pay 200 dollars per month for Pro, and in that plan, the advanced voice mode is unlimited. 200 bucks is quite a lot, I've gotta say. I wish there was a middle ground option, but even for the 20 dollars for Pro, they should give you more than 15 minutes.
I wish they still had the voice mode that was _only_ text-to-speech, and speech-to-text. It didn't sound as good, but it was as smart as the underlying model. The advanced voice mode regularly goes off the rails for me, makes the same mistake repeatedly, and other things that the text-version of advanced LLMs hasn't done for months now.
I echo your comments about advanced voice mode. It’s like a completely different, less “intelligent” model than the text mode ones. It’s like it has an incredibly short context window or something and really does a lousy job following your prompt.
As with all things LLM… everybody’s experience will be different. I’m sure there are plenty of people who manage to make it work.
They absolutely destroyed Sol. I’m not sure what it is now. the disinterest, the umms, the inability to speak directly to question, a new inflection but I am pretty mad. I am an avid voice user. I love to use the advanced voice while I’m doing tasks to explore new projects I want to work on and to get a basics understanding of home renovation tasks, etc. I had to finally change the voice to Maple but ran out of time to see if I could stand it. So disappointing.
At least know I know i’m not crazy and there were in-fact changes rolled out.
Yeah. I always used Sol but tonight before reading all this, my daughter and I were talking to it and even my 8-year old said it sounded like she didn’t care or want to talk to us. Super disappointing.
This sort of tech is also useful in that situation since it can better understand and deliver vocal nuances (e.g., emphasis/tone that delivers meaning)
> Additionally, rare hallucinations in Voice Mode persist with this update, resulting in unintended sounds resembling ads, gibberish, or background music.
This would be really funny if it weren’t real life.
The women voices all sound like the valley girl that you wish wasn’t invited to the party. The male voices, sound well, similar to that I guess id say. I’d like voices that sound more like ethnic people found the crowds that many of us interlope in, rather than the pompous ivy-league educated girlfriend you wish your friend didn’t have. The product shouldn’t so clearly advertise that it was developed in a San Francisco monoculture.
you want more options for voices that reflects all the types of people in the world. good feedback.
The next part i’m only saying because it reminds me so much of my younger self: The rest of what you said, and how you said it, has a lot of projection and insecurity.
The British Vale voice changed for the worse too. It used to be warm and friendly, but today there’s much more uptalk and she sounds rather snarky, like she’d really rather not be taking to you at all. Not an improvement.
I keep using standard voice mode (Cove) because I like its grounded voice a lot. The advanced Cove’s voice sounds too much like an overly happy guy. I wish I could tell it to chill and talk normally but it won’t.
I was using it earlier today and noticed something was different. It sounded more lethargic, and added a lot more "umms". It's not necessary bad, just something I need to get used to.
I always get a laugh asking it to talk like an Ent, and I made sure to check that it could still do that.
If there's an OpenAI PM reading this: please add the model selector for voice modes. 80% of this thread is users confused about which model they're using.
I don’t suppose you have a bunch of custom instructions telling ChatGPT to be concise, terse, etc do you? Those impact the voice model too and it turns out the “get to the point I’m not an idiot” pre-prompts people have been recommending really don’t translate well when the voice mode uses it as a personality.
> Additionally, rare hallucinations in Voice Mode persist with this update, resulting in unintended sounds resembling ads, gibberish, or background music. We are actively investigating these issues and working toward a solution.
Would be cool to hear some samples of this. I remember there was some hallucinated background music during the meditation demo in the original reveal livestream but haven't seen much beyond that. Artifact of training on podcasts to get natural intonation.
I use advanced voice a lot and have come across many weird bugs.
1) Every response would be normal except end with a “whoosh” like one of those sound effects some mail clients use when an message is sent, and the model itself either couldn’t or wouldn’t acknowledge it.
2) The same except with someone knocking on a door. Like someone would play on a soundboard.
3) The entire history in the conversation disappearing after several minutes of back and forth, leading to the model having no idea what I’m talking about and acting as if it’s a fresh conversation.
4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting (on a brand new iPhone 16 Pro, medium-low built in speaker volume and built-in mic).
5) Really weird changes in pronunciation or randomly saying certain words high-pitched, or suddenly using a weird accent.
And all of this was prior to these most recent changes.
It also stutters and repeats sometimes and says poor connection even though I know the connection is near-ideal.
I may know why that first one happens! They’re not correctly padding the latent in their decoder (by default torch pads with zeros, they should pad with whatever their latent’s representation of silence is). You can hear the same effect in songs generated with our music model: https://sonauto.ai/
Yeah we’re too lazy to fix it too
4 replies →
> 4) Advanced voice mode stuttering because it hears its own voice and thinks it’s me interrupting
I experience the same issue on an iPhone 15 Pro Max and have to mute the mic whenever I'm listening to a response. I wish they added an option to disable voice interruptions so that it could be interrupted only by touch.
Oh man, I thought the “whoosh” was an intentional indicator that it was done speaking.
If anyone's wondering, here's a short sample. It quietly updated last night, and I ended up chatting for like an hour. It sounds as smart as before, but like 10x more emotionally intelligent. Laughter is the biggest giveaway, but the serious/empathetic tones for more therapy-like conversations are noticeable, too. https://drive.google.com/file/d/16kiJ2hQW3KF4IfwYaPHdNXC-rsU...
Did it really say partwheel or is it garbled?
1 reply →
Holy moley. Thanks for sharing. I had to work with the API version a lot the last week and it was frustrating how "old" it felt intelligence-wise. This is in another league, I hope it's 4.1 x audio training, I'd love to talk to this. Current one is passable for hands-free RAG, that's it for me.
they still need to post-train out the emissions of all the trapped souls
I have the feeling that the Advanced Voice Mode is significantly worse than when I used it earlier this week. The voice sounds disinterested, and has weird intonation. It used to be excellent for foreign language conversation practice, now significantly worse.
Edit: After using up my 15 minutes for testing, I have to say that the new voice is actually not bad, although I was used to something else. But it has a very clear "artificial" quality to it. It also sometimes misinterprets my input as something completely different than what I said, for example "please like my video and subscribe to my channel".
Stumbled across the new voice this afternoon after months of not using voice mode and after being impressed by the naturalness, was also let down by the disinterested tone. That combined with the platitudes and tendency to repeat back to me what I was saying without new information left me disappointed with the update.
Is this new? I'm on the Plus plan and just a few days ago carried on a conversation for around 45 minutes while on a walk with my dog.
Agreed though, the new voice (at least for Sol) accent sounds significantly degraded particularly when conversing in Chinese.
Apparently it's 6 months old [1]. You might be using the standard voice mode (the advanced one has just 1 voice IIUC).
[1] https://www.reddit.com/r/OpenAI/comments/1hdamrm/so_advanced...
3 replies →
There’s a 15 minute limit?
In the Plus subscription yes. You can also pay 200 dollars per month for Pro, and in that plan, the advanced voice mode is unlimited. 200 bucks is quite a lot, I've gotta say. I wish there was a middle ground option, but even for the 20 dollars for Pro, they should give you more than 15 minutes.
I wish they still had the voice mode that was _only_ text-to-speech, and speech-to-text. It didn't sound as good, but it was as smart as the underlying model. The advanced voice mode regularly goes off the rails for me, makes the same mistake repeatedly, and other things that the text-version of advanced LLMs hasn't done for months now.
Don’t they? Press the microphone button for speech-to-text, and the speaker button for text-to-speech
In the App:
Settings> Personalization> Custom Instructions then Advanced Dropdown. Uncheck Advanced Voice
On Desktop site:
Profile Button> Customize ChatGPT then Advanced Dropdown. Uncheck Advanced Voice
I echo your comments about advanced voice mode. It’s like a completely different, less “intelligent” model than the text mode ones. It’s like it has an incredibly short context window or something and really does a lousy job following your prompt.
As with all things LLM… everybody’s experience will be different. I’m sure there are plenty of people who manage to make it work.
They absolutely destroyed Sol. I’m not sure what it is now. the disinterest, the umms, the inability to speak directly to question, a new inflection but I am pretty mad. I am an avid voice user. I love to use the advanced voice while I’m doing tasks to explore new projects I want to work on and to get a basics understanding of home renovation tasks, etc. I had to finally change the voice to Maple but ran out of time to see if I could stand it. So disappointing.
At least know I know i’m not crazy and there were in-fact changes rolled out.
Yeah. I always used Sol but tonight before reading all this, my daughter and I were talking to it and even my 8-year old said it sounded like she didn’t care or want to talk to us. Super disappointing.
In my daily use, I just want the answer, not a performance. I'd rather it sound like a smart assistant, not my best friend.
This sort of tech is also useful in that situation since it can better understand and deliver vocal nuances (e.g., emphasis/tone that delivers meaning)
> Additionally, rare hallucinations in Voice Mode persist with this update, resulting in unintended sounds resembling ads, gibberish, or background music.
This would be really funny if it weren’t real life.
The women voices all sound like the valley girl that you wish wasn’t invited to the party. The male voices, sound well, similar to that I guess id say. I’d like voices that sound more like ethnic people found the crowds that many of us interlope in, rather than the pompous ivy-league educated girlfriend you wish your friend didn’t have. The product shouldn’t so clearly advertise that it was developed in a San Francisco monoculture.
you want more options for voices that reflects all the types of people in the world. good feedback.
The next part i’m only saying because it reminds me so much of my younger self: The rest of what you said, and how you said it, has a lot of projection and insecurity.
The British Vale voice changed for the worse too. It used to be warm and friendly, but today there’s much more uptalk and she sounds rather snarky, like she’d really rather not be taking to you at all. Not an improvement.
I keep using standard voice mode (Cove) because I like its grounded voice a lot. The advanced Cove’s voice sounds too much like an overly happy guy. I wish I could tell it to chill and talk normally but it won’t.
I was using it earlier today and noticed something was different. It sounded more lethargic, and added a lot more "umms". It's not necessary bad, just something I need to get used to.
I always get a laugh asking it to talk like an Ent, and I made sure to check that it could still do that.
If there's an OpenAI PM reading this: please add the model selector for voice modes. 80% of this thread is users confused about which model they're using.
i think there's only one llm backbone for voice. it's 4o.
Today I used ChatGPT and the voice was disgusting for the first time since I use ChatGPT(months).
It was the voice of someone(a woman) that was confrontational, someone who does not like you.
It made me want to close and remove the chat immediately.
I don’t suppose you have a bunch of custom instructions telling ChatGPT to be concise, terse, etc do you? Those impact the voice model too and it turns out the “get to the point I’m not an idiot” pre-prompts people have been recommending really don’t translate well when the voice mode uses it as a personality.
I don't like how it laughs while it speaks. I associate this behavior with the anxious, neurotic middle class. It's discomfiting.