Comment by Herring

21 days ago

I'd say try the nanogpt speedrun. It's much easier to train, and gives you a better comparison vs optimized systems.

https://github.com/KellerJordan/modded-nanogpt

The linked paper tested nanoGPT with this new transformer:

https://www.techrxiv.org/users/685780/articles/1375955-topol...

  • thanks for linking.

    Yes the paper compares the new architecture (that is also a fork of my implementation of nanoGPT) with Karpathy's nanoGPT. There are also links to the code and bench used.

    • Note I didn't say Karpathy's nanoGPT, I said use the speedrun.

      Transformers are universal function approximators. When well-tuned, they often start to approximate other innovations. Not always, thank god, but often enough that you have to be careful.

      1 reply →