← Back to context

Comment by hnben

1 month ago

> if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question

The layout of the NN is actually quite complex, which a large amount of information calculate beside the token-themselves, and the weights (think "latent vectors").

I recommend the 3b1b youtube-series on the topic.