Comment by hnben
1 month ago
> if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question
The layout of the NN is actually quite complex, which a large amount of information calculate beside the token-themselves, and the weights (think "latent vectors").
I recommend the 3b1b youtube-series on the topic.
No comments yet
Contribute on Hacker News ↗