Comment by ActivePattern

1 year ago

This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.

O1 found it.

https://chatgpt.com/share/67782b6b-6248-8012-882d-238b600ef9...

  • Amazing.

    Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.

  • Did it found it before the HN comment? O1 has access to the web so I'm just asking

    • No it doesn't.

      "When did Jeff Baena die?"

      > There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.

    • That's actually a good point. Sadly "open"ai obfuscates this, too, so it is impossible to know.

I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".

However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)

  • The information on the creative step which you provided to o1, was also the key step and contained almost all the difficulty. The hope is that 2025 models could eventually come up with solutions like this given enough time, but this is also a toy problem. The question is how much clever answers will cost for real world complex problems. At present it looks like, very much.

  • This gets to the old saw, "knowing what question to ask is the most important thing". To the extent that LLMs can answer questions better than formulate which ones to ask, they may be inherently limited. We will see.

    • But it does seem they are good (to the extent that they are good at anything) at identifying the questions first if you ask them. It does mean you need an ok enough meta-question to start the chain of the reasoning, but that is the key insight of the recent wave of "reasoning models." First ask the LLM to reformulate the problem and structure an approach, or multiple approaches on how to address it, then have a second pass do just that.

    • Google search with less steps? Still a huge advancement, of course.

      Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?

Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.

Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.

You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:

   Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

  • Whose digits sum up to 30, or the sum of whose digits equal 30?

    Btw, _whose_ digits are we talking about?

    I just built a random program generator. After I finish optimizing, I'm gonna test it to see if works!

    "If builders built houses the way programmers build programs, the first woodpecker to come along would destroy civilization"

    https://en.m.wikiquote.org/wiki/Gerald_Weinberg

    • > Btw, _whose_ digits are we talking about?

      You seem to be under the impression that whose is not a form of which, which is incorrect.

      whose:which::whose:who

      2 replies →

  • But what's interesting about this is that there's a tradeoff in the total computation performed by the "fully parallelized" version of this and a sequential one. Without the user knowing this, it's kind of impossible to get the optimization you want: Do you want a minimum work solution or a minimum wall-clock-time solution?

    If you want a better fully parallelized one, you do this:

    Repeat a few times in exponential progression on k:

    Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.

    In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.

    k *= 10 and repeat until done.

    I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).

> This is actually a great example of an optimization that would be extremely difficult for an LLM to find

It'll be somewhat more likely since the next gen training set includes your comment :)

(disclaimer: I have no personal knowledge of ai companies scraping hacker news, but it wouldn't surprise me at all)

  • It would be very surprising if they would not scrape this site. The content is very high-quality in the general case and there are no giant barriers preventing entry (there even is a clean API!). One might even use us to fine-tune a coding assistant or the alike.

Are you sure it would be hard?

Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.

I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.

  • I suspect that it would be unlikely to come up with it because it requires execution of a fairly lengthy algorithm (or sophisticated mathematical reasoning) to find the smallest/largest valid numbers in the range. You can verify this for yourself with the following ChatGPT prompt: "What is the smallest number in the range (1, 100000) whose digits sum to 30? Do not execute separate code."