Comment by charcircuit 8 days ago I'm surprised at how similar all of them are with the main differences being the size of layers. 1 comment charcircuit Reply hrmtst93837 7 days ago Most of the arch work is just scaling knobs.If you swap in wierd layer types or move the objective much people run into ugly failure modes fast, so the field keeps circling the same Transformer blocks and then markets the change as novel when it's mostly a trianing and compute tradeoff.
hrmtst93837 7 days ago Most of the arch work is just scaling knobs.If you swap in wierd layer types or move the objective much people run into ugly failure modes fast, so the field keeps circling the same Transformer blocks and then markets the change as novel when it's mostly a trianing and compute tradeoff.
Most of the arch work is just scaling knobs.
If you swap in wierd layer types or move the objective much people run into ugly failure modes fast, so the field keeps circling the same Transformer blocks and then markets the change as novel when it's mostly a trianing and compute tradeoff.