Comment by dgacmu
1 year ago
I'm amused that neither the LLM or the author identified one of the simplest and most effective optimizations for this code: Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
There's another, arguably even simpler, optimization that makes me smile. (Because it's silly and arises only from the oddity of the task, and because it's such a huge performance gain.)
You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).
Timings on my machine:
Naive implementation (mod+div for digit-sums): 1.6s. Computing digit-sum only when out of range: 0.12s. Checking for the usual case first: 0.0004s.
The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.
I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)
This is actually a great example of an optimization that would be extremely difficult for an LLM to find. It requires a separate computation to find the smallest /largest numbers in the range with digits summing to 30. Hence, an LLM is unlikely to be able to generate them accurately on-the-fly.
O1 found it.
https://chatgpt.com/share/67782b6b-6248-8012-882d-238b600ef9...
16 replies →
I tried it in OpenAI's O1. If I give it minimaxir's original prompt it writes the obvious loop, even if I include the postamble "Look for tricks that will make this function run as fast as possible in the common case".
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)
13 replies →
Excellent point. The hope is reasoning LLMs will make a difference for such problems. But it's also a great example of why the those who think being able to have the LLM iterate more will be crucial to reasoning are off base. There are many computations that a transformers (or humans for that matter) are not well equipped to represent internally, tool use during the reasoning process is unavoidable for all but the artificial or knowledge heavy problems.
Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.
You guys are picking on the problem statement. Here's a revised prompt, which also skips the silliness of single threading:
8 replies →
> This is actually a great example of an optimization that would be extremely difficult for an LLM to find
It'll be somewhat more likely since the next gen training set includes your comment :)
(disclaimer: I have no personal knowledge of ai companies scraping hacker news, but it wouldn't surprise me at all)
1 reply →
Are you sure it would be hard?
Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.
I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.
8 replies →
This gave me an idea that we can skip the whole pass over the million draws by noting that the count of draws landing in my precomputed set M (digits-sum=30) is Binomial(n=1mln, p=|M|/100k). Then we sample that count X. If X=0, the difference is not defined. Otherwise, we can directly draw (min,max) from the correct joint distribution of indices (like you’d get if you actually did X draws in M). Finally we return M[max] - M[min]. It’s O(1) at runtime (ignoring the offline step of listing all numbers whose digits sum to 30).
In fact, we could simply check for the 3 smallest and the 3 highest numbers and ignore the rest.
Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.
If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.
for 10^5, to get the same collision probability (~2 * exp(-10)), you would just need to compute the 10 maximum/minimum candidates and check against those.
With this trick you can test while generating the random numbers and if you see both values, you can short circuit the generation of random numbers.
The input generation is outside the scope of this. Otherwise you could directly choose the output values with the apropriate distribution and just skip all the rest.
(Arguably, this criticism applies to exchanging random.randint for a numpy equivalent as well, since that doesn't optimize the solution but only how quickly the question is being generated.)
5 replies →
No, you're right, I should have said 550ms and 100ms, I'm having a doof morning about timing. Thank you! Too late to edit my post.
Ahhhh so we can just return the result without doing anything!
This exactly highlights my fear of widespread use of LLMs for code - missing the actual optimisations because we’re stuck in a review, rather than create, mode of thinking.
But maybe that’s a good thing for those of us not dependent on LLMs :)
Well if you or anyone else that has good optimization and performance chops http://openlibrary.org/ has been struggling with performance a bit lately and it's hard to track down the cause. CPU load is low and nothing too much has changed lately so it's unlikely to be a bad query or something.
Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.
It might not be as awful as feared? That big a jump probably requires a dump and restore, but maybe it could still be done in place. pg_upgrade is pretty solid. But I agree - it's likely a cheap and easy perf win.
Is there source code / a GitHub link with more information?
3 replies →
Another speed-up is to skip the sum of digits check if n % 9 != 30 % 9. Sum of digits have the same remainder divided by 9 as the number. This rules out 8/9 = 88% candidates.
Would someone write a mathematical proof showing this is always true?
Now when you're operating mod 9, 10 == 1 % 9, thus
Comes from the fact that
Now using
We get that that sum(a) and n are same mod 9.
1 reply →
Did you measure it? I would expect using % would ruin your performance as it's slow, even if it allows you to avoid doing a bunch of sums (which are fast).
You can do this “without” using the modulus operation by storing the numbers in a boolean array. Start at 3999 and keep adding 9 to find the minimum. Then start at 99930 and keep subtracting 9 to find the maximum. You would need to check if the number is in the array and then if the number’s digits sum to 30.
Note that the conversion of numbers to base 10 to check the digits typically involves doing division and modulus operations, so you are already doing those even if you remove the modulus operation from this check. That is unless you find a clever way of extracting the digits using the modular multiplicative inverse to calculate x/10^k.
3 replies →
Doing a single modulo 9 operation is much faster than summing a d-digit number, which requires d modulo 10s, d divide 10s, and d sums.
2 replies →
Each sum involves determining the digits to sum, which involves using % multiple times.
Also, you don't have to use % in order to decide whether to perform the sum-of-digits check for a given value. You can just iterate over values to check in steps of 9.
> Test if the number is < min or > max _before_ doing the digit sum. It's a free 5.5x speedup that renders some of the other optimizations, like trying to memoize digit sums, unnecessary.
How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)
On the other hand, we only need to consider numbers congruent to 3 modulo 9.
And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.
And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.
And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.
I actually think precomputing the numbers with digit sum 30 is the best approach. I'd give a very rough estimate of 500-3000 candidates because 30 is rather high, and we only need to loop for the first 4 digits because the fifth can be calculated. After that, it is O(1) set/dict lookups for each of the 1000000 numbers.
Everything can also be wrapped in list comprehensions for top performance.
It's decent when you prompt it to find easy-to-miss but substantial improvements around corner cases, which is something I've taken to doing.
Basically you just have to put it in the mode that's looking for such things
(Small correction, multiply my times by 10, sigh, I need an LLM to double check that I'm converting seconds to milliseconds right. Base 550ms, optimized 70ms)
Or the other obvious optimization to hard-code the lookup in code as a huge list, instead of creating it first.
That's a prompting issue though.
Do you have an example prompt that works?
I had a scan of the code examples, but one other idea that occurred to me is that you could immediately drop any numbers below 999 (probably slightly higher, but that would need calculation rather than being intuitive).
> probably slightly higher, but that would need calculation rather than being intuitive
I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)