Comment by Herring

21 days ago

I'd say try the nanogpt speedrun. It's much easier to train, and gives you a better comparison vs optimized systems.

https://github.com/KellerJordan/modded-nanogpt

5 comments

Herring

Reply

naasking 21 days ago

The linked paper tested nanoGPT with this new transformer:

https://www.techrxiv.org/users/685780/articles/1375955-topol...

tuned 20 days ago
thanks for linking.
Yes the paper compares the new architecture (that is also a fork of my implementation of nanoGPT) with Karpathy's nanoGPT. There are also links to the code and bench used.
- Herring 20 days ago
  
  Note I didn't say Karpathy's nanoGPT, I said use the speedrun.
  Transformers are universal function approximators. When well-tuned, they often start to approximate other innovations. Not always, thank god, but often enough that you have to be careful.
  
  1 reply →

nickpsecurity 21 days ago

Labs were also competing to train BERT's for $20 or less. People still use them a lot, too.

https://www.databricks.com/blog/mosaicbert

I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.