← Back to context

Comment by laserbeam

16 days ago

On the one hand, this is very convenient. Probably cool for some non-fiction.

On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.

I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.

> On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well

This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".

AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.

  • RIP to future top-enders that would normally have started out on the bottom to middle end.

    • > RIP to future top-enders that would normally have started out on the bottom to middle end.

      This stance always reminds me of the Profession, a 1957 novella by Isaac Asimov that depicts pretty much the future where there are only top performers and the ignorant crowd.

      1 reply →

    • Virtually every book I want this for has been around for 70+ years and still no high or low quality audiobook has been produced. How long do I have to wait for those aspiring top-enders before an audiobook can be made available?

      2 replies →

    • I'm super opposed to AI, but I see this as a rare positive. As someone already said, the win here is to have a audiobook where one doesn't yet exist. hell, maybe the tables will turn and the scrubs will do the hard work of discovering which titles are popular with an audience, then the ebook industry can capitalize on AI by hiring voice actors to produce proper titles?

      1 reply →

    • It's common for shows to use big name actors as voices because they draw an audience, nothing will change. Just means a smaller pool of voice actors and they'll mostly be good looking.

    • The value of distribution is increasing while the value of content and product is decreasing for all but the top end.

    • Not RIP at all. "Meritocracy" was coined in a book literally warning us about how terrible such a society would be: https://en.wikipedia.org/wiki/The_Rise_of_the_Meritocracy

      The "top-enders" are the privileged who need to have some of their gains for their intelligence redistributed to others. The alternative is "survival of the smartest", which is de-facto what we have today and what Young was trying to warn us about.

    • By that time, AI will beat the toppest of the top enders. Remember the time Deep Blue barely beat Kasparov? Now no human, or group of humans can beat a chess engine, even one that runs on an iPhone.

      2 replies →

  • AI TTS has been available for quite some time. Tacotron V1 is about 8 years old. I don't think we saw much bottom end replacement.

    IMGO(gut opinion), generative AI is a consumption aid, like a strong antacid. It lets us be done with $content quicker, for content = {book, art, noisy_email, coding_task}. There's obvious preconceptions forming among us all from "generative" nomenclature, but lots of surviving usages are rather reductive in relevant useful manners.

I wholeheartedly agree. https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on Terry Pratchett's Discworld series. I loved "Going Postal".

  • I know someone who listened Terry Pratchett's "Wachen! Wachen!" audiobook on Spotify while living in Germany for few years. It was so well narrated that he also acquired some peculiarities of local dialects used by specific characters in the book. Locals in Bavaria were quite surprised of a foreigner speaking such language.

Absolutely.

Even on the non-fiction side, the narration for Gleick's The Information adds something.

While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.

I agree with you, but also want to point out:

New authors, self-publishers, can't afford tens of thousands of dollars to get an audiobook recorded professionally... This can limit their distribution.

Authors might even choose not to make such version (or lack confidence to record themselves), so AI capable of making a decently passable version would be nice -- something more than reading text blandly. AI in theory could attempt to track the scene and adjust.

  • By observation the current approach is for authors to narrate the book themselves of they think their readers will want it and if they feel reasonably confident in their own narration.

Yes, but if the alternative is not having a book, or having to listen to one poorly read (I love Librivox, but there are some books which I just haven't been able to finish because of readers, and many more which were nixed for family vacation travel listening on that account), this may be workable.

With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.

I wonder if a standardized markup exists to do so.

  • There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.

    With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.

    Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.

    If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?

    Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.

    • Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).

      Probably the results with a model trained for this plus human audit could lead to very good results.

  • They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.

    • Have you tried those "podcast from a paper" models? They do some of the things you are saying they don't, although it's not 100% it's also miles ahead of for example human Polish TV lectors, or other monotone style narrations.

  • Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.

    TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html

    • That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.

      But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.

      I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.

      1 reply →

I guess this is still very useful if you are blind.

  • Yeah, for accessibility purposes on things that aren't already narrated, this is kind of thing is huge.

    • that's the thing. it's not just for accessibility. anything not already narrated is a fair target for TTS. i don't have time to sit down and read books. all reading is done on the go, while getting around or doing daily routines at home. i have a small book that i am reading now, which should take a few hours to finish, but in the time i manage to get done reading it i will probably have listened to two or three audio books.

      oh, and it's also a boon for those who can't afford to buy audiobooks.

      7 replies →

    • I was just thinking about automatically slapping an mp3 on every blog post, just an accessibility nicety.

      Can someone with low vision tell me if this would be useful to them? It may be that specialist tools already do this better.

      2 replies →

Agree with you on this.

My example, I was never a Wheel of Time fan, but the new audio editions done by Rosamund Pike are quite the performance, and make me like the story. She brings all the characters to life in a way thats different than just reading. It's a true performance.

On the other hand, there are a lot of narrators who are just bad, and the publisher is not going to pay for an alternate narration. These tools are a good way to re-narrate Wil Wheaton narrated books with correct pronunciation and inflection, for example.

Computer chess took a long time to get better than the best players in the world, but it was better than most chess players for many years before that. We're seeing that a lot with these generative models.

I guess using different narrators is essential for both fiction and non-fiction books if you want the full experience. Personally, I love it when audiobooks have narrators who stick to the characters’ personalities—it just feels right. Some of the audiobooks I’ve listened to have narrators who switch up their voices for each character, and others even use a different narrator for every character, which gets really good. Narration Box has been doing a really great job with this lately

A couple of my favorite audiobooks are Stranger in a Strange Land and Flowers for Algernon where the performer changes the intonation and enunciation of main character with the character’s journey and it was a revelation and made me appreciate the stories in a way I did not get reading the printed books the first time. Just the consistency of the performance is sometimes difficult to do in my imagination perhaps.

A GenAI model that read audiobooks with such dramatisation is really my dream. There are so many books that I would want to listen to, but still lack such an adaptation. Also it takes months after the book release before the audiobook gets released.

Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.

I agree but the opposite can be true too. Sometimes the narrator seems to target some general audience that doesn’t fit me at all, in a way that makes me cringe when I listen, until I stop listening altogether. In these cases I’d rather listen to a relatively flat narration from a tool like this.

Would a "better" AI would do a "better" narration with a better understanding of the text? Of course that it would imply a different (and far bigger?) model.

Anyway, even if in theory it might, in practice things may end even worse than doing it with a monotone voice.

I like one speaker in one particular book.

He also narrates another scifi book series and honestly I dislike this a lot.

He became the voice of one particular character for me.

I would love variety