Comment by shenberg

8 days ago

Moshi was an amazing tech demo, building the entire stack from scratch in 6 months with a small team was an amazing show of skill: 7B text LLM data + training, emotive TTS for synth data generation (again model + data collection), synth data pipeline, novel speech codec, rust inference stack for low latency, audio LLM architecture incl. text "thoughts" stream which was novel.

But, this piece is a fluff piece: "underfunded" means a total of around $400 million ($330 million in the initial round, $70 million for Gradium). Compare to Elevenlabs who used a $2 million pre-seed for creating their initial product.

A bunch of other stuff there is disingenuous, like comparing their 7B model to Llama-3 405B (hint: the 7B model is a _lot_ dumber). There's also the outright lie: team of 4 made Moshi, which is corrected _in the same piece_ to 8 if you read enough.

Stopped reading there: "This model (Moshi) could [...] recite an original poem in a French accent (research shows poems sound better this way)."