← Back to context

Comment by danielhanchen

1 year ago

Thanks! :) I'm pushing them into transformers, pytorch-gemma and collabing with the Gemma team to resolve all the issues :)

The RoPE fix should already be in transformers 4.38.2: https://github.com/huggingface/transformers/pull/29285

My main PR for transformers which fixes most of the issues (some still left): https://github.com/huggingface/transformers/pull/29402

Incredible indeed! Just hunting down one of these bugs feels like a very time consuming endeavor.

What's your approach for these more subtle numerical bugs?

  • I'm gonna guess he tried to reimplement some of the work from the ground up and wondered why certain results looked like they did.

    • Yep! The goal was to implement Gemma in Unsloth to make finetuning faster and use less VRAM, and my reimplementation seems to get different results than the current ones.

  • Ye it was indeed very gruelling - but very fun!! I used torch.dist everywhere, read ll implementations side by side to compare them, and had to manually inspect losses, plot them etc. It's a bit hard to automate sadly, since new archs cause new issues.