Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model

6 months ago (github.com)

Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. We are excited to launch a preview of our smallest model, which is less than 25 MB. This model has 15M parameters.

This release supports English text-to-speech applications in eight voices: four male and four female. The model is quantized to int8 + fp16, and it uses onnx for runtime. The model is designed to run literally anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required!

We're releasing this to give early users a sense of the latency and voices that will be available in our next release (hopefully next week). We'd love your feedback! Just FYI, this model is an early checkpoint trained on less than 10% of our total data.

We started working on this because existing expressive OSS models require big GPUs to run them on-device and the cloud alternatives are too expensive for high frequency use. We think there's a need for frontier open-source models that are tiny enough to run on edge devices!

382 comments

divamgupta

mlboss 6 months ago

Reddit post with generated audio sample: https://www.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...

seligman99 6 months ago
And a quick video with all of the different voices:
https://www.youtube.com/watch?v=60Dy3zKBGQg
- a96 6 months ago
  
  Thanks. I really would not want to listen to any of these regularly.
- tracker1 6 months ago
  
  Cool, thanks... aside: the last male voice sounds high/drunk.
- Eduard 6 months ago
  
  thank you!
smusamashah 6 months ago
The reddit video is awesome. I don't understand how people are calling it an OK model. Under 25MB and cpu only for this quality is amazing.
- soasme 6 months ago
  
  Just made a TTS tool based on Kitten TTS, fully browser based, no Python server backend: https://quickeditvideo.com/tts/ A tts model of this size should be industry standard!
- Retr0id 6 months ago
  
  The people calling it "OK" probably tried it for themselves. Whatever model is being demoed in that video is not the same as the 25MB model they released.
  
  5 replies →
- sergiotapia 6 months ago
  
  https://vocaroo.com/1njz1UwwVHCF
  It doesn't sound so good. Excellent technical achievement and it may just improve more and more! But for now I can't use it for consumer facing applications.
  
  1 reply →
- Mackena 6 months ago
  
  [flagged]
Zardoz84 6 months ago

Sounds very clear. For a non native english speaker like me, it's easy to understand.
tapper 6 months ago
Sounds slow and like something from an anine
- ricardobeat 6 months ago
  
  Speech speed is always a tunable parameter and not something intrinsic to the model.
  The comparison to make is expressiveness and correct intonation for long sentences vs something like espeak. It actually sounds amazing for the size. The closest thing is probably KokoroTTS at 82M params and ~300MB.
  
  3 replies →
- numpad0 6 months ago
  
  The only real questions are which Chinese gacha game they ripped data from and whether they used Claude Code or Gemini CLI for Python code. I bet one can get a formant match from output this much overfit to whatever data. This isn't going to stay up for long.
KaiserPro 6 months ago
was it cross trained on futurama voices?
- junon 6 months ago
  
  That would be a feature!
- archon810 6 months ago
  
  Sounds like Mort from Family Guy.
  
  1 reply →
- divamgupta 6 months ago
  
  It was not
Aachen 6 months ago
Impressive technical achievement, but in terms of whether I'd use it: oof, that male voice is like one of these fake-excited newsreaders. Like they're always at the edge of their breath. The female one is better but still someone reading out an advertisement for a product they were told they must act extra excited for. I assume this is what the majority of training data was like and not an intentional setting for the demo. Unsure whether I could get used to that
I use TTS on my phone regularly and recently also tried this new project on F-Droid called SherpaTTS, which grabs some models from Huggingface. They're super heavy (the phone suspends other apps to disk while this runs) and sound good, but in the first news article there were already one or two mispronunciations because it's guessing how to say uncommon or new words and it's not based on logical rules anymore to turn text into speech
Google and Samsung have each a TTS engine pre-installed on my device and those sound and work fine. A tad monotonous but it seems to always pronounce things the same way so you can always work out what the text said
Espeak (or -ng) is the absolute worst, but after 30 seconds of listening closely you get used to it and can understand everything fine. I don't know if it's the best open source option (probably there are others that I should be trying) but it's at least the most reliable where you'll always get what is happening and you can install it on any device without licensing issues
- willwade 6 months ago
  
  anyone else wants to try sherpaOnnx you can try this.. https://github.com/willwade/tts-wrapper we recently added in the kokoro models which should sound a lot better. There are a LOT of models to choose from. I have a feeling the Droid app isnt handling cold starts very well.
  
  1 reply →
- divamgupta 6 months ago
  
  Thanks a lot for the detailed feedback. We are working on some models which do not use a phonemizer
- bornfreddy 6 months ago
  
  RHvoice is pretty good, imho.

nine_k 6 months ago

I hope this is the future. Offline, small ML models, running inference on ubiquitous, inexpensive hardware. Models that are easy to integrate into other things, into devices and apps, and even to drive from other models maybe.

WhyNotHugo 6 months ago
Dedicated single-purpose hardware with models would be even less energy-intensive. It's theoretically possible to design chips which run neural networks and alike using just resistors (rather than transistors).
Such hardware is not general-purpose, and upgrading the model would not be possible, but there's plenty of use-cases where this is reasonable.
- amelius 6 months ago
  
  But resistors are, even in theory, heat dissipating devices. Unlike transistors, which can in theory be perfectly on or off (in both cases not dissipating heat).
- regularfry 6 months ago
  
  It's theoretically possible but physical "neurons" is a terrible idea. The number of connections between two layers of an FF net is the product of the number of weights in each, so routing makes every other problem a rounding error.
- divamgupta 6 months ago
  
  The thing is that the new models keep coming every day. So it’s economically not feasible to make chips for a single model
theshrike79 6 months ago
This is what Apple is envisioning with their SLMs, like having a model specifically for managing calendar events. It doesn't need to have the full knowledge of all humanity in it - just what it needs to manage the calendar.
- koolala 6 months ago
  
  Issue is their envisioning everyone only using Apple products.
  
  1 reply →
- throwaway28733 6 months ago
  
  Apple's hardware is notoriously overpriced, so I don't think they're envisioning that at all.
  
  2 replies →
discardedrefuse 6 months ago
Hmm. A pay once (or not at all) model that can run on anything? Or a subscription model that locks you in, and requires hardware that only the richest megacorps can afford? I wonder which one will win out.
- tracker1 6 months ago
  
  The popular one.
divamgupta 6 months ago

That is our vision too!
divamgupta 6 months ago

This is our goal too.
rohan_joshi 6 months ago

yeah totally. the quality of these tiny models are only going to go up.

peanut_merchant 6 months ago

I ran some quick benchmarks.

Ubuntu 24, Razer Blade 16, Intel Core i9-14900HX

  Performance Results:

  Initial Latency: ~315ms for short text

  Audio Generation Speed (seconds of audio per second of processing):
  - Short text (12 chars): 3.35x realtime
  - Medium text (100 chars): 5.34x realtime
  - Long text (225 chars): 5.46x realtime
  - Very Long text (306 chars): 5.50x realtime

  Findings:
  - Model loads in ~710ms
  - Generates audio at ~5x realtime speed (excluding initial latency)
  - Performance is consistent across different voices (4.63x - 5.28x realtime)

divamgupta 6 months ago

Thanks for running the benchmarks. Currently the models are not optimized yet. We will optimize loading etc when we release an SDK meant for production :)
don-bright 6 months ago
on my Intel(R) Celeron(R) N4020 CPU @ 1.10GHz it takes 6 seconds to import/load and text generation is roughly 1x realtime on various lengths of text.
- Jotalea 6 months ago
  
  thanks for testing on the same hardware as mine, before me.

blopker 6 months ago

Web version: https://clowerweb.github.io/kitten-tts-web-demo/

It sounds ok, but impressive for the size.

nine_k 6 months ago
Does anybody find it funny that sci-fi movies have to heavily distort "robot voices" to make them sound "convincingly robotic"? A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations. I don't expect a smart toaster to talk like a BBC host; it'd be enough is the speech if easy to recognize.
- userbinator 6 months ago
  
  A robotic, explicitly non-natural voice would be perfectly acceptable, and even desirable, in many situations[...]it'd be enough is the speech if easy to recognize.
  We've had formant synths for several decades, and they're perfectly understandable and require a tiny amount of computing power, but people tend not to want to listen to them:
  https://en.wikipedia.org/wiki/Software_Automatic_Mouth
  https://simulationcorner.net/index.php?page=sam (try it yourself to hear what it sounds like)
  
  7 replies →
- roywiggins 6 months ago
  
  This one is at least an interesting idea: https://genderlessvoice.com/
  
  8 replies →
- mfro 6 months ago
  
  In the Culture novels, Iain Banks imagines that we would become uncomfortable with the uncanny realism of transmitted voices / holograms, and intentionally include some level of distortion to indicate you're speaking to an image
- incone123 6 months ago
  
  Depends on the movie. Ash and Bishop in the Alien franchise sound human until there's a dramatic reason to sound more 'robotic'.
  I agree with your wider point. I use Google TTS with Moon+Reader all the time (I tried audio books read by real humans but I prefer the consistency of TTS)
  
  2 replies →
- Twirrim 6 months ago
  
  > I don't expect a smart toaster to talk like a BBC host;
  Well sure, the BBC have already established that it's supposed to sound like a brit doing an impersonation of an American: https://www.youtube.com/watch?v=LRq_SAuQDec
- looperhacks 6 months ago
  
  I remember that the novelization of the fifth element describes that the cops are taught to speak as robotic as possible when using speakers for some reason. Always found the idea weird that someone would _want_ that
- addandsubtract 6 months ago
  
  If you're on a Mac, you can type "say [thing to say]" into your terminal.
- msgodel 6 months ago
  
  I personally prefer the older synthetic voices for TTS when the text is coming from software or a language model.
bkyan 6 months ago
I got an error when I tried the demo with 6 sentences, but it worked great when I reduced the text to 3 sentences. Is the length limit due to the model or just a limitation for the demo?
- divamgupta 6 months ago
  
  Currently we don't have chunking enabled yet. We will add it soon. That will remove the length limitations.
- cess11 6 months ago
  
  Perhaps a length limit? I tried this:
  "This first Book proposes, first in brief, the whole Subject, Mans disobedience, and the loss thereupon of Paradise wherein he was plac't: Then touches the prime cause of his fall, the Serpent, or rather Satan in the Serpent; who revolting from God, and drawing to his side many Legions of Angels, was by the command of God driven out of Heaven with all his Crew into the great Deep."
  It takes a while until it starts generating sound on my i7 cores but it kind of works.
  This also works:
  "blah. bleh. blih. bloh. blyh. bluh."
  So I don't think it's a limit on punctuation. Voice quality is quite bad though, not as far from the old school C64 SAM (https://discordier.github.io/sam/) of the eighties as I expected.
Retr0id 6 months ago
I tried to replicate their demo text but it doesn't sound as good for some reason.
If anyone else wants to try:
> Kitten TTS is an open-source series of tiny and expressive text-to-speech models for on-device applications. Our smallest model is less than 25 megabytes.
- cortesoft 6 months ago
  
  Is the demo using the not smallest model?
  
  1 reply →
quantummagic 6 months ago
Doesn't work here. Backend module returns 404 :
https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
- Retr0id 6 months ago
  
  Looks like this commit 15 minutes ago broke it https://github.com/clowerweb/kitten-tts-web-demo/commit/6b5c...
  (seems reverted now)
itake 6 months ago
> Error generating speech: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Non-zero status code returned while running Expand node. Name:'/bert/Expand' Status Message: invalid expand shape
Doesn't seem to work with thai.
- jainilprajapati 6 months ago
  
  You can also try on https://clowerweb.github.io/node_modules/onnxruntime-web/dis...
nxnsxnbx 6 months ago
Thanks, I was looking for that. While the reddit demo sounds ok, even though on a level we reached a couple of years ago, all TTS samples I tried were barley understandable at all
- divamgupta 6 months ago
  
  This is just an early checkpoint. We hope that the quality will improve in the future.
Aardwolf 6 months ago
On PC it's a python dependency hell but someone managed to package it in self contained JS code that works offline once it loaded the model? How is that done?
- a2128 6 months ago
  
  ONNXRuntime makes it fairly easy, you just need to provide a path to the ONNX file, give it inputs in the correct format, and use the outputs. The ONNXRuntime library handles the rest. You can see this in the main.js file: https://github.com/clowerweb/kitten-tts-web-demo/blob/main/m...
  Plus, Python software are dependency hell in general, while webpages have to be self-contained by their nature (thank god we no longer have Silverlight and Java applets...)
scotty79 6 months ago

It feels like it doesn't handle punctuation well. I don't hear sentence boundaries and commas. It sounds like continuous stream of words.
rohan_joshi 6 months ago

yeah, this is just a preview model from an early checkpoint. the full model release will be next week which includes a 15M model and an 80M model, both of which will have much higher quality than this preview.
rldjbpin 6 months ago

besides issues with webgpu (it is in beta fwiw), it'd be nice to increase voice speed through the setting without affecting the voice pitch.
Jotalea 6 months ago

Using male voice 2 at 48kHz at 0.5x speed sounds a lot like Madeline's voice lines in Celeste. Seemed funny to me.
belchiorb 6 months ago
This doesn’t seem to work on Safari. Works great on Chrome, though
- divamgupta 6 months ago
  
  Hmm, we will look into it.
  
  1 reply →
kenarsa 6 months ago
[flagged]
- gary_0 6 months ago
  
  Not open source. "You will need internet connectivity to validate your AccessKey with Picovoice license servers ... If you wish to increase your limits, you can purchase a subscription plan." https://github.com/Picovoice/orca#accesskey
  
  2 replies →
- satvikpendem 6 months ago
  
  Does an apk for Android exist for replacing its speech to text engine? I tried sherpa-onnx but it was too slow for real time usage it seemed, and especially so for audiobooks when sped up.
  
  2 replies →

MutedEstate45 6 months ago

The headline feature isn’t the 25 MB footprint alone. It’s that KittenTTS is Apache-2.0. That combo means you can embed a fully offline voice in Pi Zero-class hardware or even battery-powered toys without worrying about GPUs, cloud calls, or restrictive licenses. In one stroke it turns voice everywhere from a hardware/licensing problem into a packaging problem. Quality tweaks can come later; unlocking that deployment tier is the real game-changer.

rohan_joshi 6 months ago

yeah, we are super excited to build tiny ai models that are super high quality. local voice interfaces are inevitable and we want to power those in the future. btw, this model is just a preview, and the full release next week will be of much higher quality, along w another ~80M model ;)
woadwarrior01 6 months ago
> It’s that KittenTTS is Apache-2.0
Have you seen the code[1] in the repo? It uses phonemizer[2] which is GPL-3.0 licensed. In its current state, it's effectively GPL licensed.
[1]: https://github.com/KittenML/KittenTTS/blob/main/kittentts/on...
[2]: https://github.com/bootphon/phonemizer
Edit: It looks like I replied to an LLM generated comment.
- oezi 6 months ago
  
  The issue is even bigger: phonemizer is using espeak-ng, which isn't very good at turning graphemes into phonemes. In other TTS which rely on phonemes (e.g. Zonos) it turned out to be one of the key issues which cause bad generations.
  And it isn't something you can fix, because the model was trained on bad phonemes (everyone uses Whisper + then phonemizes the text transcript).
- jacereda 6 months ago
  
  https://github.com/KittenML/KittenTTS/issues/17
  
  8 replies →
- gorgoiler 6 months ago
  
  This would only apply if they were distributing the GPL licensed code alongside their own code.
  If my MIT-licensed one-line Python library has this line of code…
  run([“bash”, “-c”, “echo hello”])
  …I’m not suddenly subject to bash’s licensing. For anyone wanting to run my stuff though, they’re going to need to make sure they themselves have bash installed.
  (But, to argue against my own point, if an OS vendor ships my library alongside a copy of bash, do they have to now relicense my library as GPL?)
  
  11 replies →
- Hackbraten 6 months ago
  
  Given that the FSF considers Apache-2.0 to be compatible with GPL-3.0 [0], how could the fact that phonemizer is GPL-3.0 possibly be an issue?
  [0]: https://www.gnu.org/licenses/license-list.html#apache2
  
  2 replies →
- keyKeeper 6 months ago
  
  Okay, what's stopping you from feeding the code into an LLM and re-write it and make it yours? You can even add extra steps like make it analyze the code block by block then supervise it as it is rewriting it. Bam. AI age IP freedom.
  Morals may stop you but other than that? IMHO all open source code is public domain code if anyone is willing to spend some AI tokens.
  
  7 replies →
defanor 6 months ago
A Festival's English model, festvox-kallpc16k, is about 6 MB, and it is a large model; festvox-kallpc8k is about 3.5 MB.
eSpeak NG's data files take about 12 MB (multi-lingual).
I guess this one may generate more natural-sounding speech, but older or lower-end computers were capable of decent speech synthesis previously as well.
- Joel_Mckay 6 months ago
  
  Custom voices could be added, but the speed was more important to some users.
  $ ls -lh /usr/bin/flite
  Listed as 27K last I checked.
  I recall some Blind users were able to decode Gordon 8-bit dialogue at speeds most people found incomprehensible. =3
  
  1 reply →
pjc50 6 months ago

> KittenTTS is Apache-2.0
What about the training data? Is everyone 100% confident that models are not a derived work of the training inputs now, even if they can reproduce input exactly?
entropie 6 months ago

I play around with a nvidia jetson orin nano super right now and its actually pretty usuable with gemma3:4b and quite fast - even image processing is done in like 10-20 seconds but this is with GPU support. When something is not working and ollama is not using the GPU this calls take ages because the cpu is just bad.
Iam curious how fast this is with CPU only.
phh 6 months ago

It depends on espeak-ng which is GPLv3
ethan_smith 6 months ago

This opens up voice interfaces for medical devices, offline language learning tools, and accessibility gadgets for the visually impaired - all markets where cloud dependency and proprietary licenses were showstoppers.
Narishma 6 months ago
But Pi Zero has a GPU, so why not make use of it?
- a96 6 months ago
  
  Because then you're stuck on that device only.
CyberDildonics 6 months ago

The github just has a few KB of python that looks like an install script. How is this used from C++ ?

antisol 6 months ago

  System Requirements
  Works literally everywhere

Haha, on one of my machines my python version is too old, and the package/dependencies don't want to install.

On another machie the python version is too new, and the package/dependencies don't want to install.

akx 6 months ago
I opened a couple of PRs to fix this situation:
https://github.com/KittenML/KittenTTS/pull/21 https://github.com/KittenML/KittenTTS/pull/24 https://github.com/KittenML/KittenTTS/pull/25
If you have `uv` installed, you can try my merged ref that has all of these PRs (and #22, a fix for short generation being trimmed unnecessarily) with
uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --output output.wav --text "This high quality TTS model works without a GPU"
- tetris11 6 months ago
  
  Thanks for the quick intro into UV, it looks like docker layers for python
  I found the TTS a bit slow so I piped the output into ffplay with 1.2x speedup to make it sound a bit better
  uvx --from git+https://github.com/akx/KittenTTS.git@pr-21-22-24-25 kittentts --text "I serve 12 different beers at my restaurant for over 1000000 customers" --voice expr-voice-3-m --output - | ffplay -af "atempo=1.2" -f wav -
  
  2 replies →
VagabundoP 6 months ago

Install it with uvx that should solve the python issues.
https://docs.astral.sh/uv/guides/tools/
uv installation:
https://docs.astral.sh/uv/getting-started/installation/
IshKebab 6 months ago

Yeah some people have a problem and think "I'll use Python". Now they have like fifty problems.
77pt77 6 months ago

I had the too new.
This package is the epitome of dependency hell.
Seriously, stick with piper-tts.
Easy to install, 50MB gives you excellent results and 100MB gives you good results with hundreds of voices.
xena 6 months ago
It doesn't work on Fedora because of the lack of g++ having the right version.
- trostaft 6 months ago
  
  Not sure if they've fixed between then and now, but I just had it working locally on Fedora.
  > g++ --version g++ (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) Copyright (C) 2025 Free Software Foundation, Inc.
divamgupta 6 months ago
We are working to fix that. Thanks
- pjc50 6 months ago
  
  "Fixing python packaging" is somewhat harder than AGI.
  
  5 replies →
- raybb 6 months ago
  
  Have you considered offering a uvx command to run to get people going quickly?
  
  3 replies →
- flanked-evergl 6 months ago
  
  Just point people to uv/uvx.
  
  5 replies →
hahn-kev 6 months ago
Python man
- baobun 6 months ago
  
  man python
  There you go.
  
  3 replies →
turnsout 6 months ago
You're getting a lot of comments along the lines of "Why don't you just ____," which only shows how Stockholmed the entire Python community is.
With no other language are you expected to maintain several entirely different versions of the language, each of which is a relatively large installation. Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?
I'm going to get downvoted to oblivion, but it doesn't change the reality that Python in 2025 is unnecessarily fragile.
- jhurliman 6 months ago
  
  That’s exactly what I have. The C++ codebases I work on build against a specific pinned version of LLVM with many warnings (as errors) enabled, and building with a different version entails a nonzero amount of effort. Ubuntu will happily install several versions of LLVM side by side or compilation can be done in a Docker container with the correct compiler. Similarly, the TypeScript codebases I work with test against specific versions of node.js in CI and the engine field in package.json is specified. The different versions are managed via nvm. Python is the same via uv and pyproject.yaml.
  
  1 reply →
- debugnik 6 months ago
  
  I agree with your point, but
  > if we all had five different llvms or gccs
  Oof, those are poor examples. Most compilers using LLVM other than clang do ship with their own LLVM patches, and cross-compiling with GCC does require installing a toolchain for each target.
  
  1 reply →
- 77pt77 6 months ago
  
  > Can you imagine if we all had five different llvms or gccs just to compile five different modern C projects?
  Yes, because all I have to do is look at the real world.
sigmoid10 6 months ago
There are still people who use machine wide python installs instead of environments? Python dependency hell was already bad years ago, but today it's completely impractical to do it this way. Even on raspberries.
- superkuh 6 months ago
  
  Yep. Python stopped being Python a decade ago. Now there are just innumberable Pythons. Perl... on the otherhand, you can still run any perl script from any time on any system perl interpreter and it works! Granted, perl is unpopular and not getting constant new features re: hardcore math/computation libs.
  Anyway, I think I'll stick with Festival 1.96 for TTS. It's super fast even on my core2duo and I have exactly zero chance of getting this Python 3'ish script to run on any machine with an OS older than a handful of years.
  
  1 reply →
- lynx97 6 months ago
  
  Debian pretty much "solved" this by making pip refuse to install packages if you are not in an venv.
  
  4 replies →
- yjftsjthsd-h 6 months ago
  
  Using venv won't save you from having the wrong version of the actual Python interpreter installed.
  
  1 reply →
dzogchen 6 months ago
Such an ignorant thing to say for something that requires 25MB RAM.
- Bilal_io 6 months ago
  
  Not sure what the size has to do with anything.
  I send you a 500kb Windows .exe file and claim it runs literally everywhere.
  Would it be ignorant to say anything against it because of its size?
  
  2 replies →
- dlcarrier 6 months ago
  
  It reminds me of the costs and benefits of RollerCoaster Tycoon being written in assembly language. Because it was so light on resources, it could run on any privately owned computer, or at least anything x86, which was pretty much everything at the time.
  Now, RISC architectures are much more common, so instead of the rare 68K Apple/Amiga/etc computer that existed at the time, it's super common to want to run software on an ARM or occasionally RISC-V processor, so writing in x86 assembly language would require emulation, making for worse performance than a compiled language.
exe34 6 months ago

system python is for system applications that are known to work together. If you need a python install for something else, there's venv or conda and then pip install stuff.
Tatiana343 6 months ago

[flagged]
miellaby 6 months ago

You're supposed to use venv for everything but the python scripts distributed with your os

klipklop 6 months ago

I tried it. Not bad for the size (of the model) and speed. Once you install all the massive number of libraries and things needed we are a far cry away from 25MB though. Cool project nonetheless.

devnen 6 months ago
That's a great point about the dependencies.
To make the setup easier and add a few features people are asking for here (like GPU support and long text handling), I built a self-hosted server for this model: https://github.com/devnen/Kitten-TTS-Server
The goal was a setup that "just works" using a standard Python virtual environment to avoid dependency conflicts.
The setup is just the standard git clone, pip install in a venv, and python server.py.
- k4rnaj1k 6 months ago
  
  Oh wow, really impressive. How long did this take you to make?
  
  1 reply →
Dayshine 6 months ago
It mentions ONNX, so I imagine an ONNX model is or will be available.
ONNX runtime is a single library, with C#'s package being ~115MB compressed.
Not tiny, but usually only a few lines to actually run and only a single dependency.
- wongarsu 6 months ago
  
  The repository already runs an ONNX model. But the onnx model doesn't get English text as input, it gets tokenized phonemes. The prepocessing for that is where most of the dependencies come from.
  Which is completely reasonable imho, but obviously comes with tradeoffs.
  
  1 reply →
- divamgupta 6 months ago
  
  We will try to get rid of dependencies.
WhyNotHugo 6 months ago
Usually pulling in lots of libraries helps develop/iterate faster. Then can be removed later once the whole thing starts to take shape.
- zelphirkalt 6 months ago
  
  This case might be different, but ... usually that "later" never happens.

keyle 6 months ago

I don't mind so much the size in MB, the fact that it's pure CPU and the quality, what I do mind however is the latency. I hope it's fast.

Aside: Are there any models for understanding voice to text, fully offline, without training?

I will be very impressed when we will be able to have a conversation with an AI at a natural rate and not "probe, space, response"

Dayshine 6 months ago

Nvidia's parakeet https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 appears to be state of the art for english: 10x faster than Whisper.
My mid-range AMD CPU is multiple times faster than realtime with parakeet.
colechristensen 6 months ago
>Aside: Are there any models for understanding voice to text, fully offline, without training?
OpenAI's whisper is a few years old and pretty solid.
https://github.com/openai/whisper
- Hackbraten 6 months ago
  
  Whisper tends to fill silence with random garbage from its training set. [0] [1] [2]
  [0]: https://github.com/openai/whisper/discussions/679 [1]: https://github.com/openai/whisper/discussions/928 [2]: https://github.com/openai/whisper/discussions/2608
jiehong 6 months ago

Voice to text fully offline can be done with whisper. A few apps offer it for dictation or transcription.
blensor 6 months ago
"The brown fox jumps over the lazy dog.."
Average duration per generation: 1.28 seconds
Characters processed per second: 30.35
--
"Um"
Average duration per generation: 0.22 seconds
Characters processed per second: 9.23
--
"The brown fox jumps over the lazy dog.. The brown fox jumps over the lazy dog.."
Average duration per generation: 2.25 seconds
Characters processed per second: 35.04
--
processor : 0
vendor_id : AuthenticAMD
cpu family : 25
model : 80
model name : AMD Ryzen 7 5800H with Radeon Graphics
stepping : 0
microcode : 0xa50000c
cpu MHz : 1397.397
cache size : 512 KB
- moffkalast 6 months ago
  
  Hmm that actually seems extremely slow, Piper can crank out a sentence almost instantly on a Pi 4 which is a like a sloth compared to that Ryzen and the speech quality seems about the same at first glance.
  I suppose it would make sense if you want to include it on top of an LLM that's already occupying most of a GPU and this could run in the limited VRAM that's left.
- keyle 6 months ago
  
  assuming most answers will be more than a sentence, 2.25 seconds is already long enough if you factor the token generation in between... and imagine with reasoning!... We're not there yet.
Teever 6 months ago
Any idea what factors play into latency in TTS models?
- divamgupta 6 months ago
  
  Mostly model size, and input size. Some models which use attention are O(N^2)

sandreas 6 months ago

Cool.

While I think this is indeed impressive and has a specific use case (e.g. in the embedded sector), I'm not totally convinced that the quality is good enough to replace bigger models.

With fish-speech[1] and f5-tts[2] there are at least 2 open source models pushing the quality limits of offline text-to-speech. I tested F5-TTS with an old NVidia 1660 (6GB VRAM) and it worked ok-ish, so running it on a little more modern hardware will not cost you a fortune and produce MUCH higher quality with multi-language and zero-shot support.

For Android there is SherpaTTS[3], which plays pretty well with most TTS Applications.

1: https://github.com/fishaudio/fish-speech

2: https://github.com/SWivid/F5-TTS

3: https://github.com/woheller69/ttsengine

divamgupta 6 months ago

We have released just a preview of the model. We hope to get the model much better in the future releases.
nickpsecurity 6 months ago

Fish Speech says its weights are for non-commercial use.
Also, what are the two's VRAM requirents? This model has 15 million parameters which might run on low-power, sub-$100 computers with up-to-date software. Your hardware was an out-of-date 6GB GPU.

wkat4242 6 months ago

Hmm the quality is not so impressive. I'm looking for a really naturally sounding model. Not very happy with piper/kokoro, XTTS was a bit complex to set up.

For STT whisper is really amazing. But I miss a good TTS. And I don't mind throwing GPU power at it. But anyway. this isn't it either, this sounds worse than kokoro.

echelon 6 months ago
> Hmm the quality is not so impressive. [...] And I don't mind throwing GPU power at it.
This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
Back in the pre-Tacotron2 days, I was running slim TTS and vocoder models like GlowTTS and MelGAN on Digital Ocean droplets. No GPU to speak of. It cost next to nothing to run.
Since then, the trend has been to scale up. We need more models to scale down.
In the future we'll see small models living on-device. Embedded within toys and tools that don't need or want a network connection. Deployed with Raspberry Pi.
Edge AI will be huge for robotics, toys and consumer products, and gaming (ie. world models).
- wkat4242 6 months ago
  
  > This isn't for you, then. You should evaluate quality here based on the fact you don't need a GPU.
  I know but it was more of a general comment. A really good TTS just isn't around yes in the OSS sphere. I looked at some of the other suggestions here but they have too many quirks. Dia sounds great but messages must have certain lengths etc and it picks a random voice every time. I'd love to have something self hosted that's as good as openai.
kamranjon 6 months ago
The best open one I've found so far is Dia - https://github.com/nari-labs/dia - it has some limitations, but i think it's really impressive and I can run it on my laptop.
- wkat4242 6 months ago
  
  Thanks I'll try! I like how it sounds, the quality is really good. But the limitations are really severe (shorter than 5 seconds is not ok, > 30 seconds is not ok, it will play a random voice every time, those make it pretty much unusable for an assistant to be honest).
  But it might be worth setting it up and seeing if it improves over time.
  
  1 reply →
jainilprajapati 6 months ago
You should give try to https://pinokio.co/
- wkat4242 6 months ago
  
  Thanks I'll try!
gnulinux 6 months ago
Imho chatterbox is the current open weight SOTA model in terms of quality: https://huggingface.co/ResembleAI/chatterbox
- wkat4242 6 months ago
  
  Thank you, I hadn't heard of it. Will have a look! The samples sound excellent indeed.
guskel 6 months ago

Chatterbox is also worth a try.
kenarsa 6 months ago
Try https://github.com/Picovoice/orca
- wkat4242 6 months ago
  
  Thanks!

dr_kiszonka 6 months ago

Microsoft's and some of Google's TTS models make the simplest mistakes. For instance, they sometimes read "i.e." as "for example." This is a problem if you have low vision and use TTS for, say, proofreading your emails.

Why does it happen? I'm genuinely curious.

lynx97 6 months ago
Well, speech synthesizers are pretty much famous for speaking all sorts of things wrong. But what I find very concerning about LLM based TTS is that some of them cant really speak numbers greater then 100. They try, but fail a lot. At least tts-1-hd was pretty much doing this for almost every 3 or 4 digit number. Especially noticeable when it is supposed to read a year number.
- wongarsu 6 months ago
  
  From the web demo this model is really good at numbers. It rushes through them, slurs them a bit together, but they are all correct, even 7 digit numbers (didn't test further).
  Looks like they are sidestepping these kinds of issues by generating the phonemes with the preprocessing stage of traditional speech synthesizers, and using the LLM only to turn those phonemes into natural-ish sounding speech. That limits how natural the model can become, but it should be able to correctly pronounce anything the preprocessing can pronounce
- jpc0 6 months ago
  
  Not entirely related but humans have the same problem.
  For scriptwriting when doing voice overs we always explicitly write out everything. So instead of 1 000 000 we would write one million or a million. This is a trivial example but if the number was 1 548 736 you will almost never be able to just read that off. However one million, five hundred and forty eight thousand, seven hundred and thirty six can just be read without parsing.
  Same with urls, W W W dot Google dot com.
  
  1 reply →
Retr0id 6 months ago

They're often trained from video subtitles, and humans writing subtitles make that kind of mistake too.
3rd3 6 months ago
You probably mean "e.g." as "for example", not "i.e."?
This might be on purpose and part of the training data because "for example" just sounds much better than "e.g.". Presumably for most purposes, linguistic naturalness is more important than fidelity.
- layer8 6 months ago
  
  Sometimes I use “for example” and “e.g.” in consecutive sentences to not sound repetitive, or possibly even within the same sentence (e.g. in parentheses). In that case, speaking both as “for example” would degrade it linguistically.
  In any case, I’d like TTS to not take that kind of artistic freedom.
- dr_kiszonka 6 months ago
  
  I did mean "i.e." That's why it is a problem : - \

GaggiX 6 months ago

https://huggingface.co/KittenML/kitten-tts-nano-0.1

https://github.com/KittenML/KittenTTS

This is the model and Github page, this blog post looks very much AI generated.

toisanji 6 months ago

Wow, amazing and good work, I hope to see more amazing models running on CPUs!

rohan_joshi 6 months ago

thanks, we're going to release many more models in the future, that can run on just CPUs.

OfflineSergio 6 months ago

amazing! can't wait to integrate it into https://desktop.with.audio I'm already using KokorosTTS without a GPU. It works fairly well on Apple Silicon.

Foundational tools like this open up the possiblity of one-time payment or even free tools.

rohan_joshi 6 months ago

would love to see how that turns out. the full model release next week will be more expressive and higher quality than this one so we're excited to see you try that out.

pkaye 6 months ago

Where does the training data come for the models? Is there an openly available dataset the people use?

spapas82 6 months ago

This great for english, but is there something similar for other languages? Could this be trained somehow to support other languages?

onair4you 6 months ago

Okay, lots of details information and example code, great. But skimming through I didn’t see any audio samples to judge the quality?

TheAceOfHearts 6 months ago
They posted a demo on reddit[0]. It sounds amazing given the tiny size.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1mhyzp7/kitten_...
- onair4you 6 months ago
  
  Thanks! Yeah. It definitely isn’t the absolute best in quality but it trounces the default TTS options on macOS (as third party developers are locked out of the Siri voices). And for less than the size of many modern web pages…

dang 6 months ago

Most of these comments were originally posted to a different thread (https://news.ycombinator.com/item?id=44806543). I've moved them hither because on HN we always prefer to give the project creators credit for their work.

(it does however explain how many of these comments are older than the thread they are now children of)

ricardobeat 6 months ago

The samples featured elsewhere seem to be from a larger model?

After testing this locally, it still sounds quite mechanical, and fails catastrophically for simple phrases with numbers ("easy as 1-2-3"). If the 80M model can improve on this and keep the expressiveness seen in the reddit post, that looks promising.

nullc 6 months ago

Might be useful to split it up to use the same speech features as some off the shelf vocoder, such as FARGAN used by RADE (https://freedv.org/radio-autoencoder/) and DRED (https://jmvalin.ca/papers/valin_dred_journal.pdf).

maxloh 6 months ago

Hi. Will the training and fine-tuning code also be released?

It would be great if the training data were released too!

killerstorm 6 months ago

I'm curious why smallish TTS models have metallic voice quality.

The pronunciation sounds about right - i thought it's the hard part. And the model does it well. But voice timbre should be simpler to fix? Like, a simple FIR might improve it?

nickpsecurity 6 months ago

We change our tone based on personal style, emotion, context, and other factors. An accurate generator might need to encode all that information in the model. It will be larger than a model that doesn't do all of that.
codedokode 6 months ago

Probably "metallicity" is due to lack of details and cannot be fixed that easy.

rishav_sharan 6 months ago

Question for the experts here; What would be a SOTA TTS that can run on an average laptop (32GB RAM, 4GB VRAM). I just want to attach a TTS to my SLM output, and get the highest possible voice quality/ human resembleness.

kroaton 6 months ago

Try Unmute by Kyutai - https://unmute.sh/

babycommando 6 months ago

Someone please port this to ONNX so we don't need to do all this ass tooling

RobKohr 6 months ago

What's a good one in reverse; speech to text?

jasonjmcghee 6 months ago

Whisper and the many variants. Here's a good implementation.
https://github.com/ggml-org/whisper.cpp
wenc 6 months ago

This one is a whisper-based Python package
https://github.com/primaprashant/hns

victorbjorklund 6 months ago

It is not the best TTS but it is freaking amazing it can be done by such a small model and it is good enough for so many use cases.

rohan_joshi 6 months ago

thanks, but keep in mind that this model is just a preview checkpoint that is only 10% trained. the full release next week will be of much higher quality and it will include a 15M model and an 80M model.

77pt77 6 months ago

How does this compare to say piper-tts?

I ask because their models are pretty small. Some sound awesome and there is no depdendency hell like I'm seeing here.

Example: https://rhasspy.github.io/piper-samples/#en_US-ryan-high

tapper 6 months ago

I am blind and use NVDA with a sinth. How is this news? I don't get it! My sinth is called eloquence and is 4089KB

mwcampbell 6 months ago

Does your Eloquence installation include multiple languages? The one I have is only 1876 KB for US English only. And classic DECtalk is even smaller; I have here a version that's only 638 KB (again, US English only).

the_arun 6 months ago

I like the direction we are heading. Build models that can run on CPUs & AI can become even more mainstream.

junon 6 months ago

This feels different. This feels like a genuinely monumental release. Holy cow.

Very well done. The quality is excellent and the technical parameters are, simply, unbelievable. Makes me want to try to embed this on a board just to see if it's possible.

akx 6 months ago

This is a fun model for circuit-bending, because the voice style vectors are pretty small.

For instance, try adding `np.random.shuffle(ref_s[0])` after the line `ref_s = self.voices[voice]`...

EDIT: be careful with your system volume settings if you do this.

alexnewman 6 months ago

I'm so confused on how the model is actually made. It doesn't seem to be in the code or this stuff is way simpler than i thought. It seems to use a fancy library from japan, not sure how much it's just that

butz 6 months ago

How does one build similar model, but for different languages? I was under impression that being open source, there would be some instructions how to build everything on your own.

dirkc 6 months ago

Have you considered adding some 'rendered' examples of what the model sounds like?

I'm curious, but right now I don't want to install the package and run some code.

wewewedxfgdf 6 months ago

Chrome does TTS too.

https://codepen.io/logicalmadboy/pen/RwpqMRV

tecleandor 6 months ago

Not bad for the size (with my very limited knowledge of this field) !

In a couple tests, the "Male 2" voice sounds reasonable, but I've found it has problem with some groups of words, specially when played with little context. I think it's small sentences.

For example, if you try to do just "Hey gang!", it will sound something like "Chay yang". But if you add an additional sentence after that, it will sound a bit different (but still weird).

mayli 6 months ago

Is this english only?

a2128 6 months ago
If you're looking for other languages, Piper has been around in this scene for much longer and they have open-source training code and a lot of models (they're ~60MB instead of 25MB but whatever...) https://huggingface.co/rhasspy/piper-voices/tree/main
- kenarsa 6 months ago
  
  [flagged]
  
  2 replies →
riedel 6 months ago

Actually I found it irritating that the readme does not mention the language at all. I think it is not good practice to deduce it from the language of the readme itself. I would not like to have German language tts models with only a German readme...
evgpbfhnr 6 months ago

I tried on some Japanese for the kicks of it, it reads... "Chinese letter chinese letter japanese letter chinese letter..." :D
But yeah, if it's like any of the others we'll likely see a different "model" per language down the line based on the same techniques
numpad0 6 months ago

TTS is generally not multilingual. One might think a well-annotated phonetic descriptions of voices would suffice, but that's not quite how languages work nor how TTS work.
(but somehow LLMs handle multilingual input perfectly fine! that's a bit strange, if you think about that)
g7r 6 months ago

Yes. The FAQ says that multilingual capabilities are in the works.

binary132 6 months ago

I’m new to TTS models but is this something I can plug into my own engine like with LLMs, or does it require the Python stack it ships with?

anthk 6 months ago

Atom n270 running flite with a good voice -slt- vs this... would it be fast enough to play a MUD? Flite it's almost realtime fast...

gunalx 6 months ago

Would love to se something like this trained for multilingual purposes. It seems kinda like the same tier as piper, but a bit faster.

C-Loftus 6 months ago

Awesome work! Often times in the TTS space, human-similarity is given way too much emphasis at the expense of hurting user access. Frankly as long as a voice is clear and you listen to it for a while, the brain filters out most quirks you would perceive on the first pass. Hence why many blind folks still are perfectly fine using espeak-ng. The other properties like speed of generation and size make it worth it.

I've been using a custom AI audiobook generation program [0] with piper for quite a while now and am very excited to look at integrating kitten. Historically piper has been the only good option for a free CPU-only local model so I am super happy to see more competition in the space. Easy installation is a big deal, since piper historically has had issues with that. (Hence why I had to add auto installation support in [0])

[0] https://github.com/C-Loftus/QuickPiperAudiobook

zelphirkalt 6 months ago

What I am still looking for is a way to clone voice locally. I have OK hardware. For example I can use Mistral Small 3.1 or what it is called locally. Premade voices can be interesting too, but I am looking for custom voice. Perhaps by providing audio and the corresponding transcript to the model, training it, and then give it a new text and let it speak that.

csukuangfj 6 months ago

I have tested its speed on CPU and compared it with Piper, kokoro, and matcha. See https://github.com/KittenML/KittenTTS/issues/40

imprezagx2 6 months ago

BEAT THIS! Commodore C64 has the same feature called SAM - speaker synthesizer, speaks English and Polish. 48 kB of RAM

BEAT THIS!

a96 6 months ago

It's not the same feature, but at least that's not several orders of magnitude away from "run anywhere"

bashkiddie 6 months ago

TL;DR: If you are interested in TTS, you should explore alternatives

I tried to use it...

Its python venv has grown to 6 GBytes in size. The demo sentence

> "This high quality TTS model works without a GPU"

works, it takes 3s to render the audio. Audio sounds like a voice in a tin can.

I tried to have a news article read aloud and failed with

> [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running Expand node. Name:'/bert/Expand' > Status Message: invalid expand shape

If you are interested in TTS, you should explore alternatives

wewewedxfgdf 6 months ago

say is only 193K on MacOS

  ls -lah /usr/bin/say
  -rwxr-xr-x  1 root  wheel   193K 15 Nov  2024 /usr/bin/say

Usage:

  M1-Mac-mini ~ % say "hello world this is the kitten TTS model speaking"

dented42 6 months ago
That’s not a far comparison. Say just calls the speech synthesis APIs that have been around since at least Mac OS 8.
That being said, the ‘classical’ (pre-AI) speech synthesisers are much smaller than kitten, so you’re not wrong per se, just for the wrong reason.
- deathanatos 6 months ago
  
  The linked repository at the top-level here has several gigabytes of dependencies, too.
selcuka 6 months ago

SAM on Commodore 64 was only 6K:
https://project64.c64.org/Software/SAM10.TXT
Obviously it's not fair to compare these with ML models.
wnoise 6 months ago

And what dynamic libraries s it linked to? And what other data are they pulling in?
satvikpendem 6 months ago
`say` sounds terrible compared to modern neural network based text to speech engines.
- wewewedxfgdf 6 months ago
  
  Sounds about the same as Kitten TTS.
  
  1 reply →
tonypapousek 6 months ago
Tried that on 26 beta, and the default voice sounds a lot smoother than it used it.
Running `man say` reveals that "this tool uses the Speech Synthesis manager", so I'm guessing the Apple Intelligence stuff is kicking in.
- dented42 6 months ago
  
  Nothing to do with Apple Intelligence. The speech synthesiser manager (the term manager was used for OS components in Classic Mac OS) has been around since the mid 90s or so. The change you’re hearing is probably a new/modified default voice.
  
  1 reply →
a96 6 months ago

If you make a shell script that calls say, that script will be even smaller!

skurtcastle 6 months ago

Not bad. Something I would not want to listen for long without more clarity. Could work very well for non-english speakers in various tools an such.

indigodaddy 6 months ago

Can coqui run in cpu only?

palmfacehn 6 months ago

Yes, XTTS2 has been reasonably performant for me and the cloning is acceptable.

BenGosub 6 months ago

I wonder what would it take to extend it with a custom voice?

system2 6 months ago

One thing any GitHub project never has. A few-second demo.

MrGilbert 6 months ago

A localized version of this, and I could finally build my tiny Amazon Echo replacement. I would love to see all speech synthesis performed on a local device.

varenc 6 months ago

I'm doing this now with Home Assistant voice. All the TTS, STT, and LLMs involved run locally on my network. It's absurdly superior to every other voice assistant product. (Would be nice if it was just a pure multi-modal model though)

mg 6 months ago

Good TTS feels like it is something that should be natively built into every consumer device. So the user can decide if they want to read or listen to the text at hand.

I'm surprised that phone manufacturers do not include good TTS models in their browser APIs for example. So that websites can build good audio interfaces.

I for one would love to build a text editor that the user can use completely via audio. Text input might already be feasible via the "speak to type" feature, both Android and iOS offer.

But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there.

The interface I would like would offer a way to talk to write and then commands like "Ok editor, read the last paragraph" or "Ok editor, delete the last sentence".

It could be cool to do writing this way while walking. Just with a headset connected to a phone that sits in one's pocket.

jiehong 6 months ago
On Mac OS you can "speak" a text in almost every app, using built in voice (like the Siri voice or some older voices). All offline, and even from the terminal with "say".
- Fluorescence 6 months ago
  
  I tried it a few months ago to narrate an epub in Apple Books and it was very broken in a weird way. It starts out decent but after a few pages, it starts slurring, skipping words, trailing off not finishing sentences and then goes silent.
  (I've just tried it again without seeing that issue within a few pages)
  > Siri voice or some older voices
  You can choose "Enhanced" and "Premium" versions of voices which are larger and sound nice and modern to me. The "Serena Premium" voice I was using is over 200Mb and far better that this Show HN. It's very natural but kind of ruined by diabolical pronunciation of anything slightly non-standard which sadly seems to cover everything I read e.g. people/place names, technical/scientific terms or any neologisms in scifi/fantasy.
  It's so wildly incomprehensible for e.g. Tibetan names in a mountaineering book, that you have to check the text. If the word being butchered is frequently repeated e.g. main character’s name, then it's just too painful to use.
pjc50 6 months ago

Can't most people read faster than they can hear? Isn't this why phone menus are so awful?
> But there seems to be no good way to output spoken text without doing round-trips to a server and generate the audio there
As people have been pointing out, we've had mediocre TTS since the 80s. If it was a real benefit people would be using even the inadequate version.

righthand 6 months ago

The sample rate does more than change the quality.

yunusabd 6 months ago

Impressive, might use this for https://hnup.date

theshrike79 6 months ago
Love the idea, but the text it produces is way too flowery for my taste
"A new tool is stirring up excitement and debate in the programming community"
Just give me the facts without American style embellishments. You're not trying to sell me anything =)
- yunusabd 6 months ago
  
  Sorry, just saw this.
  I absolutely agree, but it's really stubborn with the flowery language. I tried adding things like "DO NOT USE EMPTY PHRASES LIKE 'EVER-EVOLVING TECH LANDSCAPE'!!!!!" to the prompt, but it just can't resist.
  I want to give the whole system an overhaul, maybe newer models are better at this. Or maybe a second LLM pass to de-flowerize (lol) the language.

thedangler 6 months ago

Elixir folks. How would I use this with Elixir? I'm new to Elixir and could use this in about 15 days.

bglusman 6 months ago
It looks like it's Python, so it might be possible to use via https://github.com/livebook-dev/pythonx ? But the parallel huggingface/bumblebee idea was also good, hadn't seen or thought of, that definitely works for a lot of other models, curious if you get working! Some chance I'll play with this myself in a few months, so feel free to report back here or DM me!
- bglusman 6 months ago
  
  I just decided to try this quickly and hit some issues on my Mac FYI, it might work better on Linux but I hit a compilation issue with `curated-tokenizers`, possibly from a typo in setup.py or pyproject.toml in curated-tokenizers, spotted by AI: -Wno-sign-compare-Wno-strict-prototypes should be -Wno-sign-compare -Wno-strict-prototypes so could perhaps fix with a PR to curated-tokenizers or by forking it...
  Might well be other issues behind that, and unclear if need any other dependencies that kitten doesn't rely on directly like torch or torchaudio? but... not 5 mins easy, but looks like issues might be able to be worked through...
  For reference this is all I was trying basically:
  Mix.install([:pythonx]) Pythonx.uv_init(""" [project] name = "project" version = "0.0.0" requires-python = ">=3.8" dependencies = [ "kittentts @ https://github.com/KittenML/KittenTTS/releases/download/0.1/kittentts-0.1.0-py3-none-any.whl" ] """)
  to get the above error.
dorian-graph 6 months ago

It's not possible so far via Bumblebee, unfortunately[1].
[1] https://github.com/elixir-nx/bumblebee/issues/209

mrfakename 6 months ago

Cool, it looks like this model is pretty similar to StyleTTS 2? Would it be possible to confirm?

pjcodes 6 months ago

This look pretty awesome. I will definitely give it a try and let you know the results

moomoo11 6 months ago

Are there any speech to text (opposite direction) that I can load on mobile app?

Perz1val 6 months ago

Is the name a joke on "If the emperor had a tts device"? It's funny

felarof 6 months ago

We can integrate this into the browser directly!

-- browserOS.com

yahoozoo 6 months ago

Is there a paper describing the architecture of the model?

andai 6 months ago

Can you run it in reverse for speech recognition?

divamgupta 6 months ago

We will release an STT model as well.
gromgull 6 months ago

no, but whisper has a 39M model: https://github.com/openai/whisper

marcobambini 6 months ago

Is there any way to get a .gguf version?

countfeng 6 months ago

Very good model, thanks for the open source

rohan_joshi 6 months ago

thanks a lot, this model is just a preview checkpoint. the full release next week will be of much higher quality.

OrangeMusic 6 months ago

It's just so annoying and idiotic that there aren't a few samples on the home page. It didn't occur to you that it's the very first thing people would want to hear?

mattfrommars 6 months ago

Can this work on intel npu unit?

oscar_zhou 6 months ago

It looks great

alexwang123 6 months ago

This is really great.

android521 6 months ago

it would be great if there is typescript support in the future