Comment by dcreater

4 days ago

As a non MLE, what are the pros/cons of OP's PyTorch re-implementation?

2 comments

dcreater

It is extremely valuable for researchers that commonly prototype theories using PyTorch on less powerful devices. Many of my colleagues run theory experiments using GPT-2 models. This allows for an easy transition to testing on a SOTA model instead.

mdaniel 3 days ago

I'm not a ML engineer, so I can speak to the "non MLE" bit from my perspective

(literal tl;dr: learning and experimentation opportunity)

1. Since it's just PyTorch, that means one can run it locally upon whatever accelerator you have that PyTorch supports. For quite a few people that includes Metal Performance Shaders: https://docs.pytorch.org/docs/stable/mps.html

I can attest that building PyTorch from git is achievable in about 15 minutes on my M1 Pro, if you really want to chase the rabbithole. Cloning PyTorch is its own special 'please. wait.', but building it is fine

2. Since it's (of the ones that I've looked at) approximately 500 lines long, it's much, much, much more digestable than a lot of the vomit that comes out of so-called production systems. Those systems usually have only heard about typed Python in passing, and they believe it is a fad that will blow over. The ones in this repo aren't stellar about it, but at 500 lines it's easily achievable to type hint the code yourself, which can serve as an excellent learning opportunity

3. PyTorch offers some fun conversion tools, also, allowing one to compare-and-contrast how it executes under Torch versus ONNX <https://docs.pytorch.org/docs/stable/onnx.html>, TorchScript <https://docs.pytorch.org/docs/stable/generated/torch.jit.sav...>, CoreML <https://apple.github.io/coremltools/docs-guides/source/conve...>, or a bazillion other competing frameworks

4. Related, one can play around with quantization and other "inference related" concerns (e.g. https://github.com/pytorch/ao#pytorch-native-training-to-ser... )

5. Further related, one can play around with the fine-tuning mentioned elsewhere, to better understand what is and isn't possible to achieve using that process. Because the code is digestable, and the models are reasonably sized (Qwen 0.6B weighs only 1.4GB and is Apache 2), it brings FAFO opportunities in ways that gpt-oss-20b (or bigger!) won't

I do appreciate that some of what I said may skate close to "ML engineer" concerns, so obviously your situation will be different, but for me having a better grip on how these things work enables me to have better conversations with my colleagues and also helps trip my bullshit detector when someone claims they're the second coming and are going to cure cancer or whatever