Comment by ibeckermayer

11 hours ago

Cool that it's possible but basically unusable performance characteristics. For an 8192 token prompt they report a ~1.5 minute time-to-first-token and then 8.30tk/s from there. For context ChatGPT is typically <<1s ttft and ~50tk/s.

Given that APU only has 4 channels isn't this setup comically starved for bandwidth? By the same token, wouldn't you expect performance to scale approximately linearly as you add additional boxes? And wouldn't you be better off with smaller nodes (ie less RAM and CPU power per box)?

If I'm right about that then if you're willing to go in for somewhere in the vicinity of $30k (24x the Max 385 model) you should be able to achieve ChatGPT performance.