Comment by mysterEFrank

2 months ago

I'm surprised more attention isn't paid to this research direction, that nobody has tried to generalize it for example by combining the recurrence concept with next token prediction. That said despite the considerable gains this seems to just be some hyperparameter tweaking rather than a foundational improvement.

3 comments

mysterEFrank

in-silico 2 months ago

> nobody has tried to generalize it for example by combining the recurrence concept with next token prediction

Here you go: https://arxiv.org/abs/2502.05171

mysterEFrank 2 months ago

Thanks! This seems to work incredibly well.

whiplash451 2 months ago

Not just hyper parameter tweaking. Not foundational research either. But rather engineering improvements that compound with each other (conswiglu layers, muon optimizer)