← Back to context

Comment by Al-Khwarizmi

5 days ago

A great writeup, just let me make two nitpicks (not to diminish the awesome effort of the author, but just in case they wish to take suggestions).

1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.

2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.

(2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.

The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.

So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)

PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert

  • Point taken (both from you and the sibling comment mentioning Phil Blunsom), I should know better than carelessly dropping such broad generalizations as "no one in the field expected..." :)

    Still, I think only a tiny minority of the field expected it, and I think it was also clear from the messaging at the time that the OpenAI researchers who saw how GPT-3 (pre-instruct) started solving arbitrary tasks and displaying emergent abilities were surprised by that. Maybe they did have an ultimate goal in mind of creating a general-purpose system via next word prediction, but I don't think they expected it so soon and just by scaling GPT-2.

  • When you say "classification task fine tuning", are you referring to RLHF?

    RLHF seems to have been the critical piece that "aligned" the otherwise rather wild output of a purely "causally" (next-token prediction) trained LLM with what a human expects in terms of conversational turn taking (e.g. Q & A) and instruction following, as well as more general preferences/expectations.

  • You mention that encoder only approaches like bmodernBERT still have legs, would you mind sharing some applications aside from some niche NER? Genuinely curious

Nit: regarding (2), Phil Blunsom did (same Blunsom from the article, and who was leading language modeling at DeepMind for about 7-8 years). He would often opine at Oxford (where he taught) that solving next word prediction is a viable meta path to AGI. Almost nobody agreed at the time. He also called out early that scaling and better data were the key, and they did end up being, although Google wasn’t as “risk on” as OpenAI on gathering the data for GPT-1/2. Had they been history could easily have been different. People forget the position OAI was in at the time. Elon/funding had left, key talent had left. Risk appetite was high for that kind of thing… and it paid off.