S1: A $6 R1 competitor?

3 months ago (timkellogg.me)

I found the discussion around inference scaling with the 'Wait' hack so surreal. The fact such an ingeniously simple method can impact performance makes me wonder how many low-hanging fruit we're still missing. So weird to think that improvements on a branch of computer science is boiling down to conjuring the right incantation words, how you even change your mindset to start thinking this way?

  • I think the fact alone that distillation and quantization are techniques that can produce substantial improvements is a strong sign that we still have no real comprehensive understanding how the models work.

    If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.

    Yet this is what happens - the distilled or quantized models often come very close to the original model.

    So I think there are still many low-hanging fruits to pick.

    • We have a partial understanding of why distillation works—it is explained by The Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635). But if I am understanding correctly, that doesn't mean you can train a smaller network from scratch. You need a lot of randomness in the initial large network, for some neurons to have "winning" states. Then you can distill those winning subsystems to a smaller network.

      Note that similar process happens with human brain, it is called Synaptic pruning (https://en.wikipedia.org/wiki/Synaptic_pruning). Relevant quote from Wikipedia (https://en.wikipedia.org/wiki/Neuron#Connectivity): "It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5x10^14 synapses (100 to 500 trillion)."

      8 replies →

    • I like the analogy of compression, in that a distilled model of an LLM is like a JPEG of a photo. Pretty good, maybe very good, but still lossy.

      The question I hear you raising seems to be along the lines of, can we use a new compression method to get better resolution (reproducibility of the original) in a much smaller size.

      18 replies →

    • Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

      "My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

      Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

      (So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")

    • For quantization I don't think that's really true. Quantization is just making more efficient use of bits in memory to represent numbers.

    • > still have no real comprehensive understanding how the models work.

      We do understand how they work, we just have not optimised their usage.

      For example someone who has a good general understanding of how an ICE or EV car works. Even if the user interface is very unfamiliar, they can figure out how to drive any car within a couple of minutes.

      But that does not mean they can race a car, drift a car or drive a car on challenging terrain even if the car is physically capable of all these things.

      20 replies →

  • It feels like we're back in 1900 when anyone's clever idea (and implementation) can give huge performance improvements, such as Ford's assembly line and Taylor's scientific management of optimizing shovel sizes for coal.

    • yes, it also feels like we are going to lose our just-in-time global shipments of anything to anywhere any day now. It will soon feel like 1900 in other ways.

      2 replies →

  • Agreed. Here are three things that I find surreal about the s1 paper.

    (1) The abstract changed how I thought about this domain (advanced reasoning models). The only other paper that did that for me was the "Memory Resource Management in VMware ESX Server". And that paper got published 23 years ago.

    (2) The model, data, and code are open source at https://github.com/simplescaling/s1. With this, you can start training your own advanced reasoning models. All you need is a thousand well-curated questions with reasoning steps.

    (3) More than half the references in the paper are from 2024 and Jan 2025. Just look at the paper's first page. https://arxiv.org/pdf/2501.19393 In which other field do you see this?

    • Omg, another fan of "Memory Resource Management in VMware ESX Server"!! It's one of my favorite papers ever - so clever.

  • Now imagine where we are in 12 months from now. This article from February 5 2025 will feel quaint by then. The acceleration keeps increasing. It seems likely we will soon have recursive self-improving AI -- reasoning models which do AI research. This will accelerate the rate of acceleration itself. It sounds stupid to say it, but yes, the singularity is near. Vastly superhuman AI now seems to arrive within the next few years. Terrifying.

    • This is something I have been suppressing since I don't want to become chicken little. Anyone who isn't terrified by the last 3 months probably doesn't really understand what is happening.

      I went from accepting I wouldn't see a true AI in my lifetime, to thinking it is possible before I die, to thinking it is possible in in the next decade, to thinking it is probably in the next 3 years to wondering if we might see it this year.

      Just 6 months ago people were wondering if pre-training was stalling out and if we hit a wall. Then deepseek drops with RL'd inference time compute, China jumps from being 2 years behind in the AI race to being neck-and-neck and we're all wondering what will happen when we apply those techniques to the current full-sized behemoth models.

      It seems the models that are going to come out around summer time may be jumps in capability beyond our expectations. And the updated costs means that there may be several open source alternatives available. The intelligence that will be available to the average technically literate individual will be frightening.

      29 replies →

    • Yes, and Accelerationism predicted this development back in the 1990s, perhaps most prominently in the opening lines of Nick Land's Meltdown (1994) text:

        [[ ]] The story goes like this: Earth is captured by a technocapital singularity as renaissance rationalization and oceanic navigation lock into commoditization take-off. Logistically accelerating techno-economic interactivity crumbles social order in auto-sophisticating machine runaway. As markets learn to manufacture intelligence, politics modernizes, upgrades paranoia, and tries to get a grip.
      

      > reasoning models which do AI research

      In the introduction to my research project on Accelerationism [0], I write:

        Faced with the acceleration of progress in Artificial Intelligence (AI) — with AI agents now automating AI research and development —, Accelerationism no longer seems like an abstract philosophy producing empty hyperstitional hype, but like a sober description of reality. The failed 2023 memorandum to stop AI development on systems more powerful than OpenAI's ChatGPT-4 perfectly illustrates the phenomenological aspects of Accelerationism: "To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon." [1]
      

      At the current rate of acceleration, if you don't write hyperstitionally, your texts are dead on arrival.

      [0] https://retrochronic.com/

      [1] Nick Land (2017). A Quick-and-Dirty Introduction to Accelerationism in Jacobite Magazine.

      2 replies →

  • I think a skill here is learning a bias for experimentation and accepting the results one finds. Also the book "Why Greatness Cannot Be Planned" showcases the kind of open ended play that results in people discovering stuff like this.

  • One thing is to realize that we as humans have a thinking steps (internal monologue) before we output the texts. When LLMs produce text, we expect this thinking process to happen as well, but it does not - they are 'idiots that babble the first thing that comes to their minds'.

    The above 'hack' is one of many realizations of the above differences.

  • In a way it's the same thing as finding that models got lazier closer to Christmas, ie the "Winter Break" hypothesis.

    Not sure what caused the above but In my opinion not only is the training affected by the date of training data (ie it refuses to answer properly because every year of the training data there was fewer or lower quality examples at the end of the year), or whether it's a cultural impression of humans talking about going on holiday/having a break etc in the training data at certain times and the model associating this with the meaning of "having a break".

    I still wonder if we're building models wrong by training them on a huge amount of data from the Internet, then fine tuning for instruct where the model learns to make certain logical associations inherent or similar to the training data (which seems to introduce a myriad of issues like the strawberry problem or is x less than y being incorrect).

    I feel like these models would have a lot more success if we trained a model to learn logic/problem solving separately without the core data set or to restrict the instruct fine tuning in some way so that we reduce the amount of "culture" it gleans from the data.

    There's so much that we don't know about this stuff yet and it's so interesting to see something new in this field every day. All because of a wee paper on attention.

  • Wait, so the trick is they reach into the context and basically switch '</think>' with 'wait' and that makes it carry on thinking?

    • Not sure if your pun was intended, but 'wait' probably works so well because of the models being trained on text structured like your comment, where "wait" is followed by a deeper understanding.

    • Yes, that's explicitly mentioned in the blog post:

      >In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

  • I've noticed that R1 says "Wait," a lot in its reasoning. I wonder if there's something inherently special in that token.

    • Semantically, wait is a bit of a stop-and-breathe point.

      Consider the text:

      I think I'll go swimming today. Wait, ___

      what comes next? Well, not something that would usually follow without the word "wait", probably something entirely orthogonal that impacts the earlier sentence in some fundamental way, like:

      Wait, I need to help my dad.

      2 replies →

  • > a branch of computer science

    It should be considered a distinct field. At some level there is overlap (information theory, Kolmogorov complexity, etc.), but prompt optimization and model distillation is far removed from computability, formal language theory, etc. The analytical methods, the techniques to create new architectures, etc. are very different beasts.

    • Almost seems more like computer engineering. Is it really that different than signal/image processing?

      I suspect CS departments don’t want to concede because they are now in the limelight…

    • I agree - I don't know what field it formally is, but computer science it is not. It is also related to information retrieval aka "Google skills", problem presentation, 'theory of mind', even management and psychology. I'm saying the latter because people often ridicule AI responses for giving bad answers that are 'too AI'. But often it is simply because not enough context-specific information was given to allow the AI to giving a more personalized response. One should compare the response to "If I had asked a random person on the internet this query, what might I have gotten". If you write "The response should be written as a <insert characteristics, context, whatever you feel is relevant>" it will deliver a much less AI. This is just as much about how you pose a problem in general, as it is about computer science.

  • Hm, I am surprised that people who are presumably knowledgeable with how attention works are surprised by this. The more tokens in the output, the more computation the model is able to do overall. Back in September, when I was testing my iOS hands-free voice AI prototype that was powered by 8B LLM, when I wanted it to give really thoughtful answers to philosophical questions, I would instruct it to output several hundred whitespace characters (because they are not read aloud) before the actual answer.

    What I am more surprised about is why models actually seem to have to produce "internal thoughts" instead of random tokens. Maybe during training having completely random tokens in thinking section derailed the model's thought process in a same way background noise can derail ours?

  • I mean the “wait” thing is obvious if you’ve ever asked an LLM to look at its own response and ask if it’s really sure about its answer.

  • May sound like a conspiracy theory, but NVIDIA and a whole lot of AI startups have a strong vested interest to not seek+publish such findings.

    If I don’t need a huge model and GPU, then AI is little more than an open source program running on an idle PC.

    I feel like AI was NVIDIA’s lifeboat as GPU mining waned. Don’t see anything after that in the near future.

  • I mean is "wait" even the ideal "think more please" phrase? Would you get better results with other phrases like "wait, a second", or "let's double-check everything"? Or domain-dependent, specific instructions for how to do the checking? Or forcing tool-use?

I'm strictly speaking never going to think of model distillation as "stealing." It goes against the spirit of scientific research, and besides every tech company has lost my permission to define what I think of as theft forever

  • At most it would be illicit copying.

    Though it's poetic justice that OpenAI is complaining about someone else playing fast and loose with copyright rules.

  • I think it's less about that and more whether or not they used the free or paid API.

    I think if OpenAI (or any other company) are paid for their compute time/access as anybody would, then using content generated by other models is fair game. Because it's an active/ongoing cost and not a passive one.

    Whereas if someone trained on my dumb Tweets or HN posts then so be it; it's a passive cost for me - I paid my time to say x thing for my own benefits (tribal monk-e social interaction) therefore I have already gotten the value out of it.

  • Maybe but something has gotta pay the bills to justify the cutting edge. I guess it's a similar problem to researching medicine.

    • Well the artists and writers also want to pay their bills. We threw them under the bus, might as well throw openAI too and get an actual open AI that we can use

    • The investment thrown at OpenAI seems deeply inflated for how much meaningful progress they're able to make with it

      I think it's clear that innovative breakthroughs in bleeding-edge research are not just a matter of blindly hurling more money at a company to build unprecedentedly expensive datacenters

      But also, even if that was a way to do it, I don't think we should be wielding the law to enable privately-held companies to be at the forefront of research, especially in such a grossly inconsistent manner

If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

  • I've had an idea since I was a kid which I can share. I was contemplating AI and consciousness generally, probably around the time I read "The Minds I".

    I reflected on the pop-psychology idea of consciousness and subconsciousness. I thought of each as an independent stream of tokens, like stream of consciousness poetry. But along the stream there were joining points between these two streams, points where the conscious stream was edited by the subconscious stream. You could think of the subconscious stream as performing CRUD like operations on the conscious stream. The conscious stream would act like a buffer of short-term memory while the subconscious stream would act like a buffer of long-term memory. Like, the subconscious has instructions related to long-term goals and the conscious stream has instructions related to short-term goals.

    You can imagine perception as input being fed into the conscious stream and then edited by the subconscious stream before execution.

    It seems entirely possible to actually implement this idea in this current day and age. I mean, it was a fever dream as a kid, but now it could be an experiment!

    • Conscious as subconscious pretending not to be sunconscious, something like that, a thin wrapper. Crud makes sense.

      Gels closely to buddhism, hell, all religions.

  • I had this exact same thought yesterday.

    I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

    No data to back my hypothesis up. So take it for what it’s worth.

    • This is where I was headed but I think you said it better. Some kind of executive process monitoring the situation, the random stream of consciousness and the actual output. Looping back around to outdated psychology you have the ego which is the output (speech), the super ego is the executive process and the id is the <think>internal monologue</think>. This isn't the standard definition of those three but close enough.

    • My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?

  • > this incomprehensible stream of embedding vectors as natural language explanation, in a way returning to encoder/decoder architecture

    this is just standard decoding, the stream of vectors is called the k/v cache

  • The problem is that RL is extremely inefficient. It's one thing to use it for fine tuning an LLM to do the chain of thought trick and quite another to do thinking entirely from scratch. The pretrained LLM does a lot of heavy lifting there.

    And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.

    • > The problem is that RL is extremely inefficient.

      Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.

      RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.

      [1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...

      4 replies →

  • Once we train models on the chain of thought outputs, next token prediction can solve the halting problem for us (eg, this chain of thinking matches this other chain of thinking).

    • I think that is how human brains work. When we practice, at first we have to be deliberate (thinking slow). Then we “learn” from our own experience and it becomes muscle memory (thinking fast). Of course, it increases the odds we are wrong.

      1 reply →

  • Comments on a google doc? Nesting in social media comments?

    Seems like similar concepts. I think there is some potential to improving how LLMs improve and further their own reasoning lines, but I'm no AI mage.

Off topic, but I just bookmarked Tim’s blog, great stuff.

I dismissed the X references to S1 without reading them, big mistake. I have been working generally in AI for 40 hears and neural networks for 35 years and the exponential progress since the hacks that make deep learning possible has been breathtaking.

Reduction in processing and memory requirements for running models is incredible. I have been personally struggling with creating my own LLM-based agents with weaker on-device models (my same experiments usually work with 4o-mini and above models) but either my skills will get better or I can wait for better on device models.

I was experimenting with the iOS/iPadOS/macOS app On-Device AI last night and the person who wrote this app was successful in combining web search tool calling working with a very small model - something that I have been trying to perfect.

If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

Whatever you want to call this “reasoning” step, ultimately it really is just throwing the model into a game loop. We want to interact with it on each tick (spin the clay), and sculpt every second until it looks right.

You will need to loop against an LLM to do just about anything and everything, forever - this is the default workflow.

Those who think we will quell our thirst for compute have another thing coming, we’re going to be insatiable with how much LLM brute force looping we will do.

  • I can't believe this hasn't been done yet, perhaps it is a cost issue.

    My literal first thought about AI was wondering why we couldn't just put it in a loop. Heck, one update per day, or one update per hour would even be a start. You have a running "context", the output is the next context (or a set of transformations on a context that is a bit larger than the output window). Then ramp that up ... one loop per minute, one per second, millisecond, microsecond.

    • The hard part is coming up with a good way to grade results. Which you need to update the weights based on the outcome, otherwise the model will not actually learn anything.

      5 replies →

  • This is a fantastic insight and really has my gears spinning.

    We need to cluster the AI's insights on a spatial grid hash, give it a minimap with the ability to zoom in and out, and give it the agency to try and find its way to an answer and build up confidence and tests for that answer.

    coarse -> fine, refine, test, loop.

    Maybe a parallel model that handles the visualization stuff. I imagine its training would look more like computer vision. Mind palace generation.

    If you're stuck or your confidence is low, wander the palace and see what questions bubble up.

    Bringing my current context back through the web is how I think deeply about things. The context has the authority to reorder the web if it's "epiphany grade".

    I wonder if the final epiphany at the end of what we're creating is closer to "compassion for self and others" or "eat everything."

  • > If an LLM output is like a sculpture, then we have to sculpt it. I never did sculpting, but I do know they first get the clay spinning on a plate.

    That’s pottery, not sculpture. Traditionally in sculpture you start from a block of marble or wood, but you can also make sculptures of cast bronze or welded steel (or clay, but you don’t use a spinning plate).

    • Thank you for the clarification. I wanted to use some kind of visual to show the model in a loop. Otherwise, I’d just have to say explicitly that the sculptor is the one in the loop, as in the person will not stop chiseling. It’s in this infinite chiseling that we get our answers (same thing as finding a limit in calculus as it approaches infinity, we will never get the discrete answer, but we will get infinitely close enough to label a discrete point confidently).

      In other words, we fly as close to the sun as possible and get our measurements :)

> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

I think the ball is very much in their court to demonstrate they actually are using their massive compute in such a productive fashion. My BigTech experience would tend to suggest that frugality went out the window the day the valuation took off, and they are in fact just burning compute for little gain, because why not...

  • This is pure speculation on my part but I think at some point a company's valuation became tied to how big their compute is so everybody jumped on the bandwagon.

    • I don't think you need to speculate too hard. On CNBC they are not tracking revenue, profits or technical breakthroughs, but how much the big companies are spending (on gpus). That's the metric!

      5 replies →

    • Matt Levine tangentially talked about this during his podcast this past Friday (or was it the one before?). It was a good way to value these companies according to their compute size since those chips are very valuable. At a minimum, the chips are an asset that acts as a collateral.

      22 replies →

  • Mainly it points to a non-scientific "bigger is better" mentality, and the researchers probably didn't mind playing around with the power because "scale" is "cool".

    Remember that the Lisp AI-labs people were working on non-solved problems on absolute potatoes of computers back in the day, we have a semblance of progress solution but so much of it has been brute-force (even if there has been improvements in the field).

    The big question is if these insane spendings has pulled the rug on real progress if we head into another AI winter of disillusionment or if there is enough real progress just around the corner to show that there is hope for investors in a post-deepseek valuation hangover.

    • We are in a phase where costs are really coming down. We had this phase from GPT2 to about GPT4 where the key to building better models was just building bigger models and training them for longer. But since then a lot of work has gone into distillation and other techniques to make smaller models more capable.

      If there is another AI winter, it will be more like the dotcom bubble: lots of important work got done in the dotcom bubble, but many of the big tech companies started from the fruits of that labor in the decade after the bubble burst

  • Besides that, AI training (aka gradient descent) is not really an "embarrassingly parallel" problem. At some point, there are diminishing returns on adding more GPUs, even though a lot of effort is going into making it as parallel as possible.

    • What? It definitely is.

      Data parallelism, model parallelism, parameter server to workers, MoE itself can be split up, etc.

      But even if it wasn’t, you can simply parallelize training runs with slight variations in hyperparameters. That is what the article is describing.

  • This claim is mathematically nonsensical. It implies a more-or-less linear relationship, that more is always better. But there's no reason to limit that to H100s. Conventional servers are, if anything, rather more established in their ability to generate value, by which I mean, however much potential AI servers may have to be more important than conventional servers that they may manifest in the future, we know how to use conventional servers to generate value now.

    And thus, by this logic, every company in the world should just be buying as many servers as they can get their hands on, because More Servers = More Value.

    Obviously, this is not happening. It doesn't take much analysis to start listing the many and manifold reasons why. Many of those reasons will apply to GPUs as well. Just as if everything in AWS got 10x faster, overnight, this would not create a situation where everyone suddenly starts grabbing more servers in AWS. Obviously everyone would start trimming down, even if perhaps in a few years time they'd find some way to use this burst of power such that they can use more later. This can't happen overnight, though. It would take time, and not "weeks" or "months" but "years" at scale.

    Incorporating the important variable of time in the analysis, if AIs become literally hundreds of times cheaper to run, today, then it is perfectly logical that the near-term demand for the hardware to run them is also going to go way, way down. However much potential AI may have, it is fairly clear looking out at the AI landscape right now that there isn't really anyone out there unlocking vast amounts of value and sitting there wringing their hands because they just can't get more GPU compute. The GPU rush has been from fear that someone will figure out how to "really" unlock AI and then they'll be stuck without the hardware to compete.

    It may be the case that vastly cheaper AI will in fact be part of unlocking that value, and that as the AI industry grows it will grow faster as a result... but that's still going to be on a multi-year time frame, not a tomorrow time frame. And all those GPUs and all those valuations are still broadly based on them being valuable real soon now, not in a few years, and all those GPU purchases are on the assumption they need them now, or on a timeframe where we can't be waiting around, rather than waiting for some rounds of exponential doublings to bring price down. The hardware curve in 5 years may be higher but the curve in the next year would be lower, and by a lot.

    And, you know, who's to say we're done? I doubt there's another 100x in there, but is someone going to eke out another 2x improvement? Or a 10x improvement? Making it easier to run lots of experiments makes it much more likely for that to happen. I'm skeptical of another 10x general improvement but 10x improvements for specific, important use cases I can't rule out.

    Edit: I should also point out this is an extremely common pattern in technology in general. Often the very hardest part is producing a thing that does a particular task at all. Once we have it in hand, once we can use it and learn how it operates and what its characteristic operating modes are, once we can try modifications to it in the real world and see what happens, optimizing it becomes much easier, sometimes explosively so by comparison. Taking any first iteration of a tech that is practical and then trying to straight-line demand based on it is silly, in all sorts of ways and all directions. The internal combustion engine, for example, has had a myriad of impacts on the world and certainly after various improvements many, many millions if not billions of them have been made... but any company that reacted to the first couple of cars and just went ballistic buying those first-generation internal combustion engines would have lost everything, and rather quickly.

The part about taking control of a reasoning model's output length using <think></think> tags is interesting.

> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait".

I had found a few days ago that this let you 'inject' your own CoT and jailbreak it easier. Maybe these are related?

https://news.ycombinator.com/item?id=42891042#42896498

  • This even points to a reason why OpenAI hides the "thinking" step: it would be too obvious that the context is being manipulated to induce more thinking.

  • It's weird that you need to do that at all, couldn't you just reject that token and use the next most probable?

In case you’re not sure what S1 is, here is the original paper: https://arxiv.org/html/2501.19393v1

  • It's linked in the blog post, too. In the first sentence, actually, but for some reason the author never bothered to attach the name to it. As if keeping track of o1, 4o, r1, r2d2, wasn't exhausting enough already.

    • > for some reason the author never bothered to attach the name to it

      Respect for his readers’ intelligence, maybe.

  •   To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation.
    

    Does this mean that the end-of-thinking delimiter is a single token? Presumably </think> or similar wasn't a single token for the base model. Did they just pick a pair of uncommon single-token symbols to use as delimiters?

    EDIT: Never mind, end of thinking is represented with <|im_start|> followed by the word 'answer', so the code dynamically adds/removes <|im_start|> from the list of stop tokens.

This feels just like telling a constraint satisfaction engine to backtrack and find a more optimal route through the graph. We saw this 25 years ago with engines like PROVERB doing directed backtracking, and with adversarial planning when automating competitive games.

Why would you control the inference at the token level? Wouldn’t the more obvious (and technically superior) place to control repeat analysis of the optimal path through the search space be in the inference engine itself?

Doing it by saying “Wait” feels like fixing dad’s laptop over a phone call. You’ll get there, but driving over and getting hands on is a more effective solution. Realistically, I know that getting “hands on” with the underlying inference architecture is way beyond my own technical ability. Maybe it’s not even feasible, like trying to fix a cold with brain surgery?

  • What would a superior control approach be? It's not clear to me how to get an LLM to be an LLM if you're not doing stochastic next token prediction. Given that, the model itself is going to know best how to traverse its own concept space. The R1 chain of thought training encourages and develops exactly that capability. Still, you want that chain of thought to terminate and not navel gaze endlessly.

    So how to externally prod it to think more when it does terminate? Replacing thought termination with a linguistic signifier of continued reasoning plus novel realization seems like a charmingly simple, principled, and general approach to continue to traverse concept space.

  • This is the difference between science and engineering. What they have done is engineering. If the result is 90% of the way there with barely any effort, its best to move on to something else that may be low hanging fruit than to spend time chasing that 10%.

  • Totally agreed this is not a solution we are looking for, in fact this is the only solution we have in our hands right now. It's a good step forward.

S1 has no relationship to R1. It's a marketing campaign for an objectively terrible and unrelated paper.

S1 is fully supervised by distilling Gemini. R1 works by reinforcement learning with a much weaker judge LLM.

They don't follow the same scaling laws. They don't give you the same results. They don't have the same robustness. You can use R1 for your own problems. You can't use S1 unless Gemini works already.

We know that distillation works and is very cheap. This has been true for a decade; there's nothing here.

S1 is a rushed hack job (they didn't even run most of their evaluations with an excuse that the Gemini API is too hard to use!) that probably existed before R1 was released and then pivoted into this mess.

For all the hype about thinking models, this feels much like compression in terms of information theory instead of a "takeoff" scenario.

There are a finite amount of information stored in any large model, the models are really good at presenting the correct information back, and adding thinking blocks made the models even better at doing that. But there is a cap to that.

Just like how you can compress a file by a lot, there is a theoretical maximum to the amount of compression before it starts becoming lossy. There is also a theoretical maximum of relevant information from a model regardless of how long it is forced to think.

  • I think an interesting avenue to explore is creating abstractions and analogies. If a model can take a novel situation and create an analogy to one that it is familiar with, it would expand its “reasoning” capabilities beyond its training data.

  • I think this is probably accurate and what remains to be seen is how "compressible" the larger models are.

    The fact that we can compress a GPT-3 sized model into an o1 competitor is only the beginning. Maybe there is even more juice to squeeze there?

    But even more, how much performance will we get out of o3 sized models? That is what is exciting since they are already performing near Phd levels on most evals.

  • my thinking (hope?) is that the reasoning models will be more like how a calculator doesn’t have to “remember” all the possible combinations of addition, multiplication, etc for all the numbers, but can actually compute the results.

    As reasoning improves the models could start with a basic set of principles and build from there. Of course for facts grounded in reality RAG would still likely be the best, but maybe with enough “reasoning” a model could simulate an approximation of the universe well enough to get to an answer.

This thing that people are calling “reasoning” is more like rendering to me really, or multi pass rendering. We’re just refining the render, there’s no reasoning involved.

  • That was succinct and beautifully stated. Thank-you for the "Aha!" moment.

    • Hah. You should check out my other comment on how I think we’re obviously in a simulation (remember, we just need to see a good enough render).

      LLMs are changing how I see reality.

  • How are you defining "reasoning"?

    Because I see these sorts of gnostic assertion about LLMs all the time about how they "definitely aren't doing <thing we normally apply to meat-brains>" by gesturing at the technical things it's doing, with no attempts to actually justify the negative assertion.

    It often comes across as privileged reason trying to justify that of course the machine isn't doing some ineffable thing only meat-brains do.

    • From my other ridiculous comment, as I do entertain simulation theory in my understanding of God:

      Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.

      The LLM doesn’t know anything. We determine what output is right, even if the LLM swears the output is right. We “reason” about it, I guess? Well in this case the whole “reasoning” process is to simply get an output that looks right, so what is reasoning in our case?

      Let me just go one ridiculous level lower. If I measure every frame the Hubble telescope takes, and I measure with a simple ruler the distances between things, frame by frame, I can “reason” out some rules of the universe (planetary orbits). In this “reasoning” process, the very basic question of “well why, and who made this” immediately arises, so reasoning always leads to the fundamental question of God.

      So, yeah. We reason to see God, because that’s all we’re seeing, everything else is an illusion. Reasoning is inextricably linked to God, so we have to be very open minded when we ask what is this machine doing.

      2 replies →

  • Yes.

    Before LLMs we had N-Gram language models. Many tasks like speech recognition worked as beach search in the graph defined by the ngram language model. You could easily get huge accuracy gains simply by pruning your beam less.

    s1 reminds of this. You can always trade off latency for accuracy. Given these LLMs are much more complex than good old N-Grams, we're just discovering how to do this trade.

    • Let me carry that concept, “learning to do this trade”, it’s a new trade.

      I don’t believe computer science has the algorithms to handle this new paradigm. Everything was about sequential deterministic outputs, and clever ways to do it fast. This stuff is useless at the moment. We need new thinkers on how to not think sequentially or how not to think about the universe in such a small way.

      Verifying input/output pairs is the old way. We need to understand differently going forward.

  • We could see it the other way around : what we call "reasoning" may actually be some kind of multipass rendering, whatever it is performed by computers or human brains.

  • Which is related to multistage/ hierarchical/coarse-to-fine optimization, which is a pretty good way to find the global optimum in many problem domains.

  • "...there’s no reasoning involved...wait, could I just be succumbing to my heuristic intuitions of what is (seems to be) true....let's reconsider using System 2 thinking..."

    • Or there is no objective reality (well there isn’t, check out the study), and reality is just a rendering of the few state variables that keep track of your simple life.

      A little context about you:

      - person

      - has hands, reads HN

      These few state variables are enough to generate a believable enough frame in your rendering.

      If the rendering doesn’t look believable to you, you modify state variables to make the render more believable, eg:

      Context:

      - person

      - with hands

      - incredulous demeanor

      - reading HN

      Now I can render you more accurately based on your “reasoning”, but truly I never needed all that data to see you.

      Reasoning as we know it could just be a mechanism to fill in gaps in obviously sparse data (we absolutely do not have all the data to render reality accurately, you are seeing an illusion). Go reason about it all you want.

      5 replies →

> "Note that this s1 dataset is distillation. Every example is a thought trace generated by another model, Qwen2.5"

The traces are generated by Gemini Flash Thinking.

8 hours of H100 is probably more like $24 if you want any kind of reliability, rather than $6.

  • "You can train a SOTA LLM for $0.50" (as long as you're distilling a model that cost $500m into another pretrained model that cost $5m)

    • The original statement stands, if what you are suggesting in addition to it is true. If the initial one-time investment of $505m is enough to distill new SOTA models for $0.50 a piece, then the average cost for subsequent models will trend toward $0.50.

    • That's absolutely fantastic, because if you have 1 good idea that's additive to the SOTA, you can test it for a dollar, not millions

I work at a mid-sized research firm, and there’s this one coworker who completely turned her performance around. A complete 180. A few months ago, she was one of the slowest on the team, now she’s always the first to get her work done. I was curious, so I asked her what changed. She just laughed and said she just used an AI tool that she randomly found on YouTube to do 90% of her work.

We’ve been working on a project together, and every morning for the past two months, she’s sent me clean, perfectly organized FED data. I assumed she was just working late to get ahead. Turns out, she automated the whole thing. She even scheduled it to send automatically. Tasks that used to take hours. Gathering 1000s of rows of data, cleaning it, running a regression analysis, time series, hypothesis testing etc… she now completes almost instantly. Everything. Even random things like finding discounts for her Pilates class. She just needs to check and make sure everything is good. She’s not super technical so I was surprised she could do these complicated workflows but the craziest part is that she just prompted the whole thing. She just types something like “compile a list of X, format it into a CSV, and run X analysis” or “go to Y, see what people are saying, give me background of the people saying Z” And it just works. She’s even joking about connecting it to the office printer. I’m genuinely baffled. The barrier to effort is gone.

Now we’ve got a big market report due next week, and she told me she’s planning to use DeepResearch to handle it while she takes the week off. It’s honestly wild. I don’t think most people realize how doomed knowledge work is.

> having 10,000 H100s just means that you can do 625 times more experiments than s1 did

The larger the organisation, the less experiments you can afford to do. Employees are mostly incentivised by getting something done quick enough to not to be fired in this job market. They know that the higher-ups would get them off for temporary gains. Rush this deadline, ship that feature, produce something that looks OK enough.

S1 (and R1 tbh) has a bad smell to me or at least points towards an inefficiency. It's incredible that a tiny number of samples and some inserted <wait> tokens can have such a huge effect on model behavior. I bet that we'll see a way to have the network learn and "emerge" these capabilities during pre-training. We probably just need to look beyond the GPT objective.

  • I agree, but LLMs in general have a horrendously bad smell in terms of efficiency. s1 and r1 are just proving it.

    The models' latent spaces are insanely large. The vast, vast majority pretty much has to be irrelevant and useless, it's just that the training commandeers random fragments of that space to link up the logic they need and it's really hard to know which of the weights are useless, which are useful but interchangeable with other weights, and which are truly load-bearing. You could probably find out easily by testing the model against every possible thing you ever might want it to do, just as soon as someone gets around to enumerating that non-enumerable collection of tasks.

    These bogus <wait> tokens kind of demonstrate that the models are sort of desperate to escape the limitations imposed by the limited processing they're allowed to do -- they'll take advantage of thinking time even when it's provided in the silliest manner possible. It's amazing what you can live if it's all you have!

    (Apologies for the extended anthropomorphizing.)

  • can you please elaborate on the wait tokens? what's that? how do they work? is that also from the R1 paper?

    • The same idea is in both the R1 and S1 papers (<think> tokens are used similarly). Basically they're using special tokens to mark in the prompt where the LLM should think more/revise the previous response. This can be repeated many times until some stop criteria occurs. S1 manually inserts these with heuristics, R1 learns the placement through RL I think.

      3 replies →

> Why did it cost only $6? Because they used a small model and hardly any data.

> After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that’s needed to achieve o1-preview performance on a 32B model. Adding data didn’t raise performance at all.

> 32B is a small model, I can run that on my laptop. They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6.

I have a bunch of questions, would love for anyone to explain these basics:

* The $5M DeepSeek-R1 (and now this cheap $6 R1) are both based on very expensive oracles (if we believe DeepSeek-R1 queried OpenAI's model). If these are improvements on existing models, why is this being reported as decimating training costs? Isn't fine-tuning already a cheap way to optimize? (maybe not as effective, but still)

* The R1 paper talks about improving one simple game - Countdown. But the original models are "magic" because they can solve a nearly uncountable number of problems and scenarios. How does the DeepSeek / R1 approach scale to the same gigantic scale?

* Phrased another way, my understanding is that these techniques are using existing models as black-box oracles. If so, how many millions/billions/trillions of queries must be probed to replicate and improve the original dataset?

* Is anything known about the training datasets used by DeepSeek? OpenAI used presumably every scraped dataset they could get their hands on. Did DS do the same?

  • If what you say is true, and distilling LLMs is easy and cheap, and pushing the SOTA without a better model to rely on is dang hard and expensive, then that means the economics of LLM development might not be attractive to investors - spending billions to have your competitors come out with products that are 99% as good, and cost them pennies to train, does not sound like a good business strategy.

    • What I still don’t understand is how one slurps out an entire model (closed source) though.

      Does the deepseek paper actually say what model it’s trained off of, or do they claim the entire thing is from scratch?

      1 reply →

  • > If these are improvements on existing models, why is this being reported as decimating training costs?

    Because that's what gets the clicks...

    Saying they spent a boatload of money on the initial training + iteration + final fine-tuning isn't as headline grabbing as "$5 million trained AI beats the pants off the 'mericans".

It just occurred to me that if you squint a little (just a little!) the S1 paper just provided the scientific explanation for why Twitter's short tweets mess you up and books are good for you.

Kidding, but not really. It's fascinating how we seem to be seeing a gradual convergence of machine learning and psychology.

> In s1, when the LLM tries to stop thinking with "</think>", they force it to keep going by replacing it with "Wait". It’ll then begin to second guess and double check its answer. They do this to trim or extend thinking time (trimming is just abruptly inserting "</think>")

I know some are really opposed to anthropomorphizing here, but this feels eerily similar to the way humans work, ie. if you just dedicate more time to analyzing and thinking about the task, you are more likely to find a better solution

It also feels analogous to navigating a tree, the more time you have to explore the nodes, the bigger the space you'll have covered, hence higher chance of getting a more optimal solution

At the same time, if you have "better intuition" (better training?), you might be able to find a good solution faster, without needing to think too much about it

  • What’s missing in that analogy is that humans tend to have a good hunch about when they have to think more and when they are “done”. LLMs seem to be missing a mechanism for that kind of awareness.

    • Great observation. Maybe an additional “routing model” could be trained to predict when it’s better to think more vs just using the current result

From the S1 paper:

> Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end

I'm feeling proud of myself that I had the crux of the same idea almost 6 months ago before reasoning models came out (and a bit disappointed that I didn't take this idea further!). Basically during inference time, you have to choose the next token to sample. Usually people just try to sample the distribution using the same sampling rules at each step.... but you don't have to! you can selectively insert words into the the LLM's mouth based on what it said previously or what it wants to say, and decide "nah, say this instead". I wrote a library so that you could sample an LLM using llama.cpp in swift and you could write rules to sample tokens and force tokens into the sequence depending on what was sampled. https://github.com/prashanthsadasivan/LlamaKit/blob/main/Tes...

Here, I wrote a test that asks Phi-3 instruct "how are you" and it if it tried to say "as an AI I don't have feelings" or "I'm doing " I forced it to say "I'm doing poorly" and refuse to help since it was always so dang positive. It sorta worked, though the instruction tuned models REALLY want to help. But at the time I just didn't have a great use case for it - I had thought about a more conditional extension to llama.cpp's grammar sampling (you could imagine changing the grammar based on previously sampled text), or even just making it go down certain paths, but I just lost steam because I couldn't describe a killer use case for it.

This is that killer use case! forcing it to think more is such a great usecase for inserting ideas into the LLM's mouth, and I feel like there must be more to this idea to explore.

  • So what you mean is that if the current train of thought is going in a direction we find to be not optimal, we could just interrupt it and hint it into the right direction?

    That sounds very useful, albeit a bit different than how current "chat" implementations would work, as in you could control both ways of the conversation.

The point about agents to conceal access to the model is a good one.

Hopefully we won’t lose all access to models in future

CoT is widely known technique - what became fully novel was the level of training embedding CoT via RL with optimal reward trajectory. DeepSeek took it further due to their compute restriction to find memory, bandwidth, parallelism optimizations in every part (GRPO - reducing memory copies, DualPipe for data batch parallelism between memory & compute, kernel bypasses (PTX level optimization), etc.) - then even using MoE due to sparse activation and further distillation. They operated on the power scaling laws of parameters & tokens but high quality data circumvents this. I’m not surprised they utilized synthetic generation from OpenAI or copied the premise of CoT, but where they should get the most credit is their infra level & software level optimizations.

With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead

Well dang, I am great at tinkering like this because I can’t remember things half the time. I wonder if the ADHD QA guy solved this for the devs?

>it can run on my laptop

Has anyone run it on a laptop (unquantized)? Disk size of the 32B model appears to be 80GB. Update: I'm using a 40GB A100 GPU. Loading the model took 30GB vRAM. I asked a simple question "How many r in raspberry". After 5 minutes nothing got generated beyond the prompt. I'm not sure how the author ran this on a laptop.

  • 32B models are easy to run on 24GB of RAM at a 4-bit quant.

    It sounds like you need to play with some of the existing 32B models with better documentation on how to run them if you're having trouble, but it is entirely plausible to run this on a laptop.

    I can run Qwen2.5-Instruct-32B-q4_K_M at 22 tokens per second on just an RTX 3090.

    • My question was about running it unquantized. The author of the article didn't say how he ran it. If he quantized it then saying he ran it on a laptop is not a news.

      2 replies →

I think a lot of people in the ML community were excited for Noam Brown to lead the O series at OpenAI because intuitively, a lot of reasoning problems are highly nonlinear i.e. they have a tree-like structure. So some kind of MCTS would work well. O1/O3 don’t seem to use this, and DeepSeek explicitly mentioned difficulties training such a model.

However, I think this is coming. DeepSeek mentioned it was hard to learn a value model for MCTS from scratch, but this doesn’t mean we couldn’t seed it with some annotated data.

> I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

Couldn't they just start hiding the thinking portion?

It would be easy for them to do this. Currently, they already provide one sentence summaries for each step of the thinking I think users would be fine or at least stay if it were changed to provide only that.

  • They hid it and deepseek came up with R1 anyway, with RL on only results and not even needing any of the thinking tokens that OpenAI hid.

    • Which is still the funniest and most interesting result in AI so far IMO. Fascinating, but sort of makes intuitive sense too!

I found it interesting but the "Wait" vs. "Hmm" bit just made me think we don't really understand our own models here. I mean, sure, it's great that they measured and found something better, but it's kind of disturbing that you have to guess.

Deepseek R1 uses <think/> and wait and you can see it in the thinking tokens second guessing itself. How does the model know when to wait?

These reasoning models are feeding more to OP's last point about NVidia and OpenAI data centers not being wasted since reason models require more tokens and faster tps.

  • From playing around they seem to 'wait' when there's a contradiction in their logic.

    And I think the second point is due to The Market thinking there is no need to spend ever increasing amounts of compute to get to the next level of AI overlordship.

    Of course Jevon's paradox is also all in the news these days..

  • Probably when it would expect a human to second guess himself, as shown in literature and maybe other sources.

This argument that the data centers and all the GPUs will be useful even in the context of Deepseek doesn't add up... basically they showed that it's diminishing returns after a certain amount. And so far it didn't make OpenAI or Anthropic go faster, did it?

  • What is the source for the diminishing returns? I would like to read about it as I have only seen papers referring to the scaling law still applying.

Maybe this is why OpenAI hides o1/o3 reasoning tokens - constraining output at inference time seems to be easy to implement for other models and others would immediately start their photocopiers.

It also gave them a few months to recoup costs!

Cool trick. But is this better than reinforcement learning, where the LLM decides for itself the optimal thinking time for each prompt?

Hmmm, 1 + 1 equals 3. Alternatively, 1 + 1 equals -3.

Wait, actually 1 + 1 equals 1.

  • As one with teaching experience, the idea of asking a student "are you sure about that?" is to get them to think more deeply rather than just blurting a response. It doesn't always work, but it generally does.

    • It works because the question itself is a hint born of knowledge. “Are you sure about that” is a polite way to say “that answer is wrong, try again”. Students know that, so instead of doubling down will redo their work with the assumption they made a mistake. It is much rarer to ask the question when the answer is correct, and in fact doing so is likely to upset the learner because they had to redo the work for no reason.

      If you want a true comparison, start asking that question every time and then compare. My hypothesis is students would start ignoring the prompt and answering “yes” every time to get on with it.

LLMs still feel so magical. It’s like quantum physics. “I get it” but I don’t. Not really. I don’t think I ever will. Perhaps a human mind can only comprehend so much.

> They used 16 NVIDIA H100s for 26 minutes per training run, that equates to around $6

Running where? H100s are usually over $2/hr, thats closer to $25

Qwen's QvQ-72B does much more "wait"s than other LLMs with CoT I tried, maybe they've somewhat used that trick already?

> even the smartest people make hundreds of tiny experiments

This is the most important point, and why DeepSeek’s cheaper training matters.

And if you check the R1 paper, they have a section for “things that didn’t work”, each of which would normally be a paper of its own but because their training was so cheap and streamlined they could try a bunch of things.

Anyone else wants more articles on how those benchmarks are created and how they work?

Those models can be trained in way tailored to have good results on specific benchmarks, making them way less general than it seems. No accusation from me, but I'm skeptical on all the recent so called 'breakthroughs'.

Is it me, or the affiliations are totally missing in the cited paper?? Looks like they come from a mix of UK / US institutions

> If you believe that AI development is a prime national security advantage, then you absolutely should want even more money poured into AI development, to make it go even faster.

This, this is the problem for me with people deep in AI. They think it’s the end all be all for everything. They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.

To me, that’s stupid. I observe the more populist and socially appealing CEOs of these VC startups (Sam Altman being the biggest, of course.) just straight up lying to the masses, for financial gain, of course.

Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

and the enthusiasm of these people that are into it feeds into those who aren’t aware of it in the slightest, they see you can chat with a ‘robot’, they hear all this hype from their peers and they buy into it. We are social creatures after all.

I think using any of this in a national security setting is stupid, wasteful and very, very insecure.

Hell, if you really care about being ahead, pour 500 billion dollars into quantum computing so u can try to break current encryption. That’ll get you so much further than this nonsensical bs.

  • You can choose to be somewhat ignorant of the current state in AI, about which I could also agree that at certain moments it appears totally overhyped, but the reality is that there hasn't been a bigger technology breakthrough probably in the last ~30 years.

    This is not "just" machine learning because we have never been able to do things which we are today and this is not only the result of better hardware. Better hardware is actually a byproduct. Why build a PFLOPS GPU when there is nothing that can utilize it?

    If you spare yourself some time and read through the actual (scientific) papers of multiple generations of LLM models, the first one being from Google ~~not DeepMind~~ in 2017, you might get to understand that this is no fluff.

    And I'm speaking this from a position of a software engineer, without bias.

    The reason why all this really took off with so much hi-speed is because of the not quite expected results - early LLM experiments have shown that "knowledge" with current transformers architecture can linearly scale with regards to the amount of compute and training time etc. That was very unexpected and to this day scientists do not have an answer why this even works.

    So, after reading bunch of material I am inclined to think that this is something different. The future of loading the codebase into the model and asking the model to explain me the code or fix bugs has never been so close and realistic. For the better or worse.

    • This line of thinking doesn't really correspond to the reason Transformers were developed in the first place, which was to better utilize how GPUs do computation. RNNs were too slow to train at scale because you had to sequentially compute the time steps, Transformers (with masking) can run the input through in a single pass.

      It is worth noting that the first "LLM" you referring to was only 300M parameters, but even then the amount of training required (at the time) was such that training a model like that outside of a big tech company was infeasible. Obviously now we have models that are in the hundreds of billions / trillions of parameters. The ability to train these models is directly a result of better / more hardware being applied to the problem as well as the Transformer architecture specifically designed to better conform with parallel computation at scale.

      The first GPT model came out ~ 8 years ago. I recall when GPT-2 came out they initially didn't want to release the weights out of concern for what the model could be used for, looking back now that's kind of amusing. However, fundamentally, all these models are the same setup as what was used then, decoder based Transformers. They are just substantially larger, trained on substantially more data, trained with substantially more hardware.

      3 replies →

  •   > Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.
    

    That sounds to me like dismissing the idea that a Russian SSBN might cross the Pacific and nuke Los Angeles because "submarines can't swim".

    Even if the machine learning isn't really intelligent, it is still capable of performing IF..THEN..ELSE operations, which could have detrimental effects for [some subset of] humans.

    And even if you argue that such a machine _shouldn't_ be used for whatever doomsday scenario would harm us, rest assured that someone, somewhere, who either does not understand what the machines are designed to do or just pretends that they work like magic, will put the machines in a position to make such a decision.

    • One could hope...

      Even at the height of the Cold War there was always a human between <leader presses button> and <nukes go aflyin'>.

      --edit--

      ...which has me wondering if a president even has the constitutional authority to destroy the entire planet and if one could interpret their command as a 'lawful order'. Makes one think.

      1 reply →

  • > They think it’s the end all be all for everything.

    Is (human-based) general intelligence not one of the fundamental enabling elements of literally every human activity throughout history, regardless of how many layers of automation and technology one has to peel back to get to it?

    Can you maybe imagine how the ability to create arbitrary amounts of general intelligence, completely divorced from the normal lengthy biological process, could upend that foundation of human activity?

    > They have the vision of the ‘AI’ they’ve seen in movies in mind, see the current ‘AI’ being used and to them it’s basically almost the same, their brain is mental bridging the concepts and saying it’s only a matter of time.

    I've found that most AI-related movies exclusively focus on "quality ASI" scenarios, which are mostly irrelevant to our current state of the world, as an immense amount of danger/value/disruption will arrive with AGI. People who are seriously reasoning about the impacts of AGI are not using movies as references. "Those stupid movie watching idiots" is just a crutch you are using to avoid thinking about something that you disagree with.

    > Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

    Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.

    • >> Real AI, artificial intelligence, is a fever dream. This is machine learning except the machines are bigger than ever before. There is no intellect.

      > Do you have any evidence to support this conclusion? And does it even matter? If "fake intellect" can replace a human, that human still has to deal with the very real issue or not having a job anymore. If "fake intellect" is used to conduct mass surveillance, and direct suppression activities towards divergent individuals, those individuals are still going to have a bad time.

      I think the "fake intelligence can replace a human" needs more support in general. We know how human intellect works practically (not theoretically) and we know how to apply it in different scenarios. We're still far from knowing how "fake intelligence" works and how to apply it to different scenarios.

      1 reply →

  • I couldn't agree more.

    If we're not talking about cyber war exclusively, such as finding and exploiting vulnerabilities, for the time being national security will still be based on traditional army.

    Just a few weeks ago, italy announced a 16bln€ plan to buy >1000 rheinmetall ifv vehicles. That alone would make italy's army one of the most equipped in Europe. I can't imagine what would happen with a 500$bln investment in defense,lol. I don't agree with what Meloni's government is doing, but one of the ministers I agree more with is the defense minister Crosetto

    Furthermore, what is being shown, at least for the time being, is that open source can be and is crucial in aiding developing better models. This collides with the idea of big, single "one winner takes it all" VC mentality (because let's be honest, these defense pitches are still made by startup/VC bros)

    • >italy announced a 16bln€ plan to buy >1000 rheinmetall ifv vehicles. That alone would make italy's army one of the most equipped in Europe.

      So target practice for a beyond-the-horizon missile system launched ground-to-ground or air-to-ground? As an attacking force, conventional ground forces and tactics are a non-runner in a modern theatre of operations when faced against air and drone support. This is why no single EU country is incentivised into dumping money into any single area - as the only probable defense would be against USA/Russia/China to begin with.

      The US proved it beyond doubt in Afghanistan - partisans simply haven't a chance against a gunship with IR or NV optics; the last time they levelled the playing field against air interdictors was in Charlie Wilson's Afghanistan when the Mujahideen took on that era of Soviet gunships with hand-held AA systems.

      1 reply →

    • It's not one or the other, though. AI-controlled drones are already a thing in Ukraine, today.

  • Been saying this for years, it's been fucking baffling. Generating images, video and text that sort-of resembles what a human would come up with is genuinely quite impressive. It is not "let's claim it'll fix our country" (looking at you, Keir) impressive though, and I cannot believe so much money has been pumped into it.

  • I can only say that exponential curves grow nominally sublinearly before they take off. AI is not quite at the obvious take off point, but owners of the biggest clusters have seen the extrapolations and it isn't pretty - once your competitor achieves take off and you aren't anywhere close, you're done for. The risk of not participating in that are too great.

  • > This is machine learning except the machines are bigger than ever before. There is no intellect.

    Define "intellect".

  • What is even the possible usage of AI for national security? Generating pictures of kittens riding nuclear weapons to the very end like in Dr Strangelove?

    • > What is even the possible usage of AI for national security? Generating pictures of kittens riding nuclear weapons to the very end like in Dr Strangelove?

      For all that critics of AI dismiss them as lacking imagination, your reaction suggests a lack of imagination.

      Off the top of my head: facial recognition and identification to make "smart" guns that hit specific targets with reduced collateral damage (as found on most digital cameras even before smartphones); creating and A/B testing propaganda campaigns; using modified wifi signals as wall-penetrating radar capable of post estimation, heart rate and breathing monitoring[0]; take any self-driving car's AI and conditionally invert the part that says "don't hit pedestrians" when a certain target is spotted; ANPR to track specific vehicles with known owners over long distances; alternative targeting system for cruise missiles in the absence or jamming of GPS systems; using them as red teams in war-game exercises; using them to automate intrusion detection by monitoring for changes to background distributions of basically every measurable event; person-tracking by watching CCTV in secure areas; control systems for security robots (think Boston Dynamics' Spot) that are currently in deployment.

      There's likely a lot more, too.

      [0] https://openaccess.thecvf.com/content_cvpr_2018/papers/Zhao_...

    • Lol: Where I live (Memphis) both “one” and “two” are considered two syllable words. Seriously. Our kids were taught this in the best public elementary school.

      2 replies →

  • Also the narrative that we are currently on the brink of Ai explosion and this random paper shows it has been the same tired old story handed out by ai hawks for years now. Like yes, I agree with the general idea that more compute means more progress for humans and perhaps having a more responsive user interface through some kind of ai type technology would be good. But I don’t see why that will turn into Data from Star Trek. But I also think all these ai hawks kind of narcissistically over value their own being. Like blink and their lives are over in the grand scheme of things. Maybe our “awareness” of the world around us is an illusion provided by evolution because we needed it to value self preservation whereas other animals don’t. There is an inherent belief in the specialness of humans that I suppose I mistrust.

  • It used to be much easier to be conservative about AI, especially AGI, after living through three cycles of AI winters. No more. Dismissing it as “merely machine learning” is worse than unfair to the last decade of machine learning ;-)

    The hard part now is relatively trivial. Does anyone think that there is a fundamental and profound discovery that evolution made purely by selection in the last 200,000 years? I mean a true qualitative difference?

    Sure—-We call it language, which is just another part of a fancy animal’s tool kit.

    Does anyone think there is an amazing qualitative difference between the brain of a chimp and the brain of a human?

    No, not if they know any biology.

    (Although that does not stop some scientist from looking for a “language gene” like FOXP2.)

    So what did dumb mutations and 200,000 years of selection do that a group of dedicated AI scientists cannot do with their own genuine general intelligence?

    Nothing—-nothing other than putting a compact energy efficient LLM with reinforcement learning on a good robotic body and letting it explore and learn like we did as infants, toddlers and teenagers.

    Each one of us has experienced becoming a “general intelligence”. I remember it hit me on the head in 6th grade when I dreamed up a different way of doing long division. I remember thinking: “How did I think that?” And each one of us who has watched an infant turn into a toddler has watched it as an observer or teacher. This is what makes babies so fascinating to “play” with.

    We have to give our baby AGI a private memory and a layer of meta-attention like we all gain as we mature, love, and struggle.

    I read the linked article and as a neuroscientist I realized the “wait” cycles that improved performance so much is roughly equivalent to the prefrontal cortex: the part of the CNS most responsible for enabling us to check our own reasoning recursively. Delay—as in delayed gratification—-is a key attribute of intelligent systems.

    We are finally on the door step to Hofstadter’s Strange Loop and Maturana’s and Valera’s “enactive” systems, but now implemented in silicon, metal, and plastic by us rather than dumb but very patient natural selection.

    Karl Friston and Demis Hassabis (two very smart neuroscientist) figured this out years ago. And they were preceded by three other world class neuroscientist: Humberto Maturana, Francisco Valera, and Rich Sutton (honorary neuroscientist). And big credit to Terry Winograd for presaging this path forward long ago too.

  • > I think using any of this in a national security setting is stupid

    What about AI enabled drones and guided missiles/rockets? The case for their effectiveness is relatively simple in terms of jamming resistance.

    • drone and missile guidance system development has been using ML for decades at this point. That's just as much "AI" as anything currently coming out of the LLM craze.

      1 reply →

    • Like a lot of AI boosters, would you like to explain how that works other than magic AI dust? Some forms of optical guidance are already in use, but there's other limitations (lighting! weather!)

      2 replies →

    • I think jamming resistance is a red herring. AI weapons will have their own failure modes due to jamming. Any sensor modality will have its own particular weakness. Also reasoning model malfunctions as well i.e. hallucinations.

      Not to mention false GPS etc...

    • This somehow reminds me of a certain killer robot from a Black Mirror episode ;)

  • > This is machine learning

    Yeah, I was thinking about this while trying to figure out author affiliations.

    There was a Stanford paper a few years ago that dusted off some old intelligence concepts and the authors seemed excited about it.

    But given the pace of AI, it's difficult to look in new directions. It will probably take an AI winter and some unbridled enthusiasm immune to burnout to make some real progress outside of feed forward neural networks.

  • I agree agi wont solve national security but saying this isn’t intelligence is false.

    This is ai and trend lines point to an intelligence that matches or barely exceeds human intellect in the future.

    You’re part of a trend of people in denial. When LLMs first came out there were hordes of people on HN claiming it was just a stochastic parrot and LLMs displayed zero intellectual ability. It is now abundantly clear that this not true.

    We don’t fully understand LLMs. That’s why gains like COT are just black box adjustments that come from changing external configurations. We have no way to read the contents of the black box and make adjustments off of it. Yet idiots like you can make such vast and hard claims when nobody really fully understands these things. You’re delusional.

    I agree that LLMs won’t allow us to make some super weapon to give us some edge in national security.

  • > then you absolutely should want even more money poured into AI development, to make it go even faster.

    Indeed. People are welcome to go "all in" on whatever nonsense gambling they want to do with their personal investments, but national security demands actually thinking about things - adversarially. Because the enemy will as well.

    It's perfectly possible to lose a war by investing in expensive superweapons that under deliver. The Nazis were particularly bad at this.

That sovereign wealth fund with tik tok might set a good precedent; when we have to 'pour money' into these companies we can do so with stake in them held in our sovereign wealth fund.

  • Extra-legal financial instruments meant to suck money from other federal departments don't strike me as a good precedent in any sense. I don't disagree though that nationalizing the value of enormous public investments is something we should be considering, looking at you oil industry. But until congress appropriates the money under law it's a pipe dream or theft.

Sorry for being lazy, but I just don't have the time right now to read the paper. Is there in the paper or somewhere else a comparison based on benchmarks of S1 vs R1 (the full R1, not quantized or distilled)?

  • The S1 paper is not meant to compete with R1. It simply shows that with 1k well curated examples for finetuning (26 minutes training on 16 GPU) and with a simple hack for controlling the length of the thinking process, one can dramatically increase the performance of a non-reasoning model and show a clear increase in benefit with increased test-time compute. It is worth a quick skim.

> Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.

(sorry for the long quote)

I will say (naively perhaps) "oh but that is fairly simple". For any API request, add a counter of 5 seconds to the next for 'unverified' users. Make the "blue check" (a-la X/Twitter). For the 'big sales' have a third-party vetting process so that if US Corporation XYZ wants access, they prove themselves worthy/not Chinese competition and then you do give them the 1000/min deal.

For everyone else, add the 5 second (or whatever other duration makes sense) timer/overhead and then see them drop from 1000 requests per minutes to 500 per day. Or just cap them at 500 per day and close that back-door. And if you get 'many cheap accounts' doing hand-overs (AccountA does 1-500, AccountB does 501-1000, AccountC does 1001-1500, and so on) then you mass block them.