← Back to context

Comment by mtrovo

2 months ago

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

Yet this is what happens - the distilled or quantized models often come very close to the original model.

So I think there are still many low-hanging fruits to pick.

  • We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

    Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

    • So, can a distilled 8B model (say, the Deepseek-R1-Distil-Llama-8B or whatever) be "trained up" to a higher parameter 16B Parameter model after distillation from a superior model, or is it forever stuck at the 8B parameters that can just be fine tuned?

  • I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

    The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

    • > in that a distilled model of an LLM is like a JPEG of a photo

      That's an interesting analogy, because I've always thought of the hidden states (and weights and biases) of an LLMs as a compressed version of the training data.

      14 replies →

    • This brings up an interesting thought too. A photo is just a lossy representation of the real world.

      So it's lossy all the way down with LLMs, too.

      Reality > Data created by a human > LLM > Distilled LLM

    • What you say makes sense, but is there the possibility that because it’s compressed it can generalize more? In the spirit of bias/variance.

    • Yeah but it does seem that they're getting high % numbers for the distilled models accuracy against the larger model. If the smaller model is 90% as accurate as the larger, but uses much < 90% of the parameters, then surely that counts as a win.

  • Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

    "My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

    Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

    (So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

  • For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

  • > still have no real comprehensive understanding how the models work.

    We do understand how they work, we just have not optimised their usage.

    For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

    But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

    • Your example is somewhat inadequate. We _fundamentally_ don’t understand how deep learning systems works in the sense that they are more or less black boxes that we train and evaluate. Innovations in ML are a whole bunch of wizards with big stacks of money changing “Hmm” to “Wait” and seeing what happens.

      Would a different sampler help you? I dunno, try it. Would a smaller dataset help? I dunno, try it. Would training the model for 5000 days help? I dunno, try it.

      Car technology is the opposite of that - it’s a white box. It’s composed of very well defined elements whose interactions are defined and explained by laws of thermodynamics and whatnot.

      13 replies →

    • We know how the next token is selected, but not why doing that repeatedly brings all the capabilities it does. We really don't understand how the emergent behaviours emerge.

      4 replies →

    • The "Wait" vs. "Hmm" discussion in the paper does not suggest we know how they work. If we knew, we wouldn't have to try things and measure to figure out the best prompt.

It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

Agreed. Here are three things that I find surreal about the s1 paper.

(1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

(2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

(3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?

  • Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

  • This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.

    I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.

    Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.

    It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.

    • This frightens mostly people whose identity is built around "intelligence", but without grounding in the real world. I've yet to see really good articulations of what, precisely we should be scared of.

      Bedroom superweapons? Algorithmic propaganda? These things have humans in the loop building them. And the problem of "human alignment" is one unsolved since Cain and Abel.

      AI alone is words on a screen.

      The sibling thread details the "mass unemployment" scenario, which would be destabilizing, but understates how much of the current world of work is still physical. It's a threat to pure desk workers, but we're not the majority of the economy.

      Perhaps there will be political instability, but .. we're already there from good old humans.

      8 replies →

    • > The intelligence that will be available to the average technically literate individual will be frightening.

      That's not the scary part. The scary part is the intelligence at scale that could be available to the average employer. Lots of us like to LARP that we're capitalists, but very few of us are. There's zero ideological or cultural framework in place to prioritize the well being of the general population over the profits of some capitalists.

      AI, especially accelerating AI, is bad news for anyone who needs to work for a living. It's not going to lead to a Star Trek fantasy. It means an eventual phase change for the economy that consigns us (and most consumer product companies) to wither and fade away.

      19 replies →

  • Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:

      [[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.
    

    > reasoning models which do AI research

    In the introduction to my research project on Accelerationism [0], I write:

      Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]
    

    At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.

    [0] https://retrochronic.com/

    [1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.

    • Hope we get the Nick Land the younger, and not Nick Land the elder, set of outcomes. Somewhere, sometime, along the way, it seems like everything from CCRU and Duginism leapt out of the page into the real. Maybe it's just the beginning of the Baudrilliardian millennium.

    • Nice. Though I couldn't understand those "opening lines" until I read in your Introduction:

      > For Land, capitalism begins in Northern Italy around 1500 with "the emerging world of technologists and accountants", the spiral interexcitation of "oceanic navigation and place-value calculation", and zero-unlocked double-entry book-keeping

      Fibonacci, amongst many others, played a critical role that highly accelerative technology.

I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

The above 'hack' is one of many realizations of the above differences.

In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.

Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

  • Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.

  • Yes, that's explicitly mentioned in the blog post:

    >In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

  • Semantically, wait is a bit of a stop-and-breathe point.

    Consider the text:

    I think I'll go swimming today. Wait, ___

    what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:

    Wait, I need to help my dad.

    • Yes, R1 seems to mostly use it like that. It's either to signal a problem with its previous reasoning, or if it's thought of a better approach. In coding it's often something like "this API won't work here" or "there's a simpler way to do this".

      1 reply →

  • I bet a token like "sht!", "f*" or "damn!" would have the same or even stronger effect but the LLM creators would not like to have the users read them

    • Maybe, but it doesn't just use it to signify that it's made a mistake. It also uses it in a positive way, such as it's had a lightbulb moment. Of course some people use expletives in the same way, but that would be less common than for mistakes.

    • I think you're onto something, however, as the training is done through on text and not actual thoughts, it may take some experimentation to find these stronger words.

> a branch of computer science

It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.

  • Almost seems more like computer engineering. Is it really that different than signal/image processing?

    I suspect CS departments don’t want to concede because they are now in the limelight…

  • I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?

I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.

  • I think NVIDIAs future is pretty bright.

    We're getting to the run-your-capable-LLM on-prem or at-home territory.

    Without DeepSeek (and hopefully its successors) I wouldn't really have a usecase for something like NVIDIAs Project Digits.

    https://www.nvidia.com/en-us/project-digits/

    • Except I can run R1 1.5b on a GPU-less and NPU-less Intel NUC from four-five years ago using half its cores and the reply speed is…functional.

      As the models have gotten more efficient and distillation better the minimum viable hardware for really cooking with LLMs has gone from a 4090 to suddenly something a lot of people already probably own.

      I definitely think a Digits box would be nice, but honestly I’m not sure I’ll need one.

      3 replies →

I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?