Comment by mtrovo

5 months ago

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

123 comments

mtrovo

xg15 5 months ago

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

ZeljkoS 5 months ago
We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.
Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."
- Arthur_ODC 5 months ago
  
  So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?
- 3abiton 5 months ago
  
  So more 'mature' models might arise in the near future with less params and better benchmarks?
  
  6 replies →
MR4D 5 months ago
I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.
The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.
- umeshunni 5 months ago
  
  > in that a distilled model of an LLM is like a JPEG of a photo
  That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.
  
  14 replies →
- cmgriffing 5 months ago
  
  This brings up an interesting thought too. A photo is just a lossy representation of the real world.
  So it's lossy all the way down with LLMs, too.
  Reality > Data created by a human > LLM > Distilled LLM
- ziofill 5 months ago
  
  What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.
- fennecfoxy 5 months ago
  
  Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.
cztomsik 5 months ago

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.
"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.
Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.
(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")
pertymcpert 5 months ago

For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.
teruakohatu 5 months ago
> still have no real comprehensive understanding how the models work.
We do understand how they work, we just have not optimised their usage.
For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.
But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.
- gessha 5 months ago
  
  Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.
  Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.
  Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.
  
  13 replies →
- spiorf 5 months ago
  
  We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.
  
  4 replies →
- adamc 5 months ago
  
  The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.

koala_man 5 months ago

It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

andrewfromx 5 months ago
yes, it also feels like we are going to lose our just-in-time global shipments of anything to anywhere any day now. It will soon feel like 1900 in other ways.
- eru 5 months ago
  
  Hope we don't get 1914 again, too.
- BobbyTables2 5 months ago
  
  We’ll have to raise our own chickens too…

ozgune 5 months ago

Agreed. Here are three things that I find surreal about the s1 paper.

(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?

pradn 5 months ago

Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

cubefox 5 months ago

Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

zoogeny 5 months ago
This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.
I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.
Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.
It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.
- pjc50 5 months ago
  
  This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world. I've yet to see really good articulations of what, precisely we should be scared of.
  Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.
  AI alone is words on a screen.
  The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.
  Perhaps there will be political instability, but .. we're already there from good old humans.
  
  8 replies →
- palmotea 5 months ago
  
  > The intelligence that will be available to the average technically literate individual will be frightening.
  That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.
  AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.
  
  19 replies →
gom_jabbar 5 months ago
Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:
[[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.
> reasoning models which do AI research
In the introduction to my research project on Accelerationism [0], I write:
Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]
At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.
[0] https://retrochronic.com/
[1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.
- pizza 5 months ago
  
  Hope we get the Nick Land the younger, and not Nick Land the elder, set of outcomes. Somewhere, sometime, along the way, it seems like everything from CCRU and Duginism leapt out of the page into the real. Maybe it's just the beginning of the Baudrilliardian millennium.
- versteegen 5 months ago
  
  Nice. Though I couldn't understand those "opening lines" until I read in your Introduction:
  > For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping
  Fibonacci, amongst many others, played a critical role that highly accelerative technology.

nyoomboom 5 months ago

I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

tomaskafka 5 months ago

One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

The above 'hack' is one of many realizations of the above differences.

fennecfoxy 5 months ago

In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.

codeulike 5 months ago

Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

danans 5 months ago

Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.
gield 5 months ago

Yes, that's explicitly mentioned in the blog post:
>In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".
luc4sdreyer 5 months ago

Yes, that's one of the tricks.

ascorbic 5 months ago

I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

lionkor 5 months ago
Semantically, wait is a bit of a stop-and-breathe point.
Consider the text:
I think I'll go swimming today. Wait, ___
what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:
Wait, I need to help my dad.
- ascorbic 5 months ago
  
  Yes, R1 seems to mostly use it like that. It's either to signal a problem with its previous reasoning, or if it's thought of a better approach. In coding it's often something like "this API won't work here" or "there's a simpler way to do this".
  
  1 reply →
katzenversteher 5 months ago
I bet a token like "sht!", "f*" or "damn!" would have the same or even stronger effect but the LLM creators would not like to have the users read them
- raducu 5 months ago
  
  It's literally in the article, they measured it and wait was the best token
- ascorbic 5 months ago
  
  Maybe, but it doesn't just use it to signify that it's made a mistake. It also uses it in a positive way, such as it's had a lightbulb moment. Of course some people use expletives in the same way, but that would be less common than for mistakes.
- lodovic 5 months ago
  
  I think you're onto something, however, as the training is done through on text and not actual thoughts, it may take some experimentation to find these stronger words.

rgovostes 5 months ago

> a branch of computer science

It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.

BobbyTables2 5 months ago

Almost seems more like computer engineering. Is it really that different than signal/image processing?
I suspect CS departments don’t want to concede because they are now in the limelight…
maginx 5 months ago

I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

lostmsu 5 months ago

Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?

kevin009 5 months ago

There are more than 10 different ways that I know for sure will improve LLMs just like `wait`. It is part if the CoT. I assume most researchers know this. CoT in old as 2019

kristianp 5 months ago

Chain of thought (CoT)?
Melatonic 5 months ago

Mind elaborating ?

deadbabe 5 months ago

I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

BobbyTables2 5 months ago

May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.

philipswood 5 months ago
I think NVIDIAs future is pretty bright.
We're getting to the run-your-capable-LLM on-prem or at-home territory.
Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.
https://www.nvidia.com/en-us/project-digits/
- Arn_Thor 5 months ago
  
  Except I can run R1 1.5b on a GPU-less and NPU-less Intel NUC from four-five years ago using half its cores and the reply speed is…functional.
  As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.
  I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.
  
  3 replies →

pradn 5 months ago

I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?

cyanydeez 5 months ago

its fascinating how certain political movements avoid that Wait moment...