Comment by bananapub
2 days ago
> On the other, we have gigantic, slow, expensive, IQ-maxxing reasoning models that we go to for deep analysis (they’re great at criticism), one-shotting complex problems, and pushing the edge of pure intelligence.
I quite enjoy having an LLM write much of my tedious code these days, but comments like this are just bizarre to me. Can someone share a text question that I can ask an expensive slow LLM that will demonstrate “deep analysis” or “iq-maxxing” on any topic? Whenever I ask them factual or discussion questions I usually get something riddled with factual errors or just tedious, like reading an essay someone wrote for school.
I use o3 for my PhD math research. When I am facing a specific problem and I am out of ideas I oass it to o3. It will usually say something with a fair number if errors and eventually claim to have solved my problem in a standard manner, which it almost never does. But that does not mean it is not useful to me. My attention is light a flashlight illuminating a tiny spot in the possibly vast field of methods I could try. Right now my head is full of dispersive PDEs so I will not think of using parabolic regularization. But o3 is more of a dim background light. I am in the end better at using any particular technique that is familiar to me than o3, but in this very moment I can only think of a few options. Sometimes my specific problem is actually naturally tackled by a method I have not considered, and o3 suggests it. If you consider that iq-maxxing or not, in this moment for me it is, because it helps me.
You should also try o4-mini-high. Or, if you have already, I’m curious to hear how they compare for you. I somewhat suspect that o4-mini is better on pure math problems that take more thinking and less world knowledge.
Yea I try them both but I honestly can not tell much of a difference. Subtle things.
I ran into a weird joystick bug the other week, and I wanted ChatGPT to figure out the exact code flow of how a specific paramter is set.
I had it analyze different related libraries, and it zeroed in on SDL and Wine codebases, and it found the exact lines of code related to the logic error in Winebus.
It really helps me dig deep for certain hard to track bugs.
I really like using o3 to help with thorny architecture problems by researching existing related solutions on the internet, distilling them, and comparing trade-offs with me
The one I asked o3-pro yesterday was "Research the annual smoking tobacco production in Soviet Union 1939-1958 and plot it in a graph versus male population size"
And how was result? Did you verify that it found reliable source of data?
This is the kind of thing I absolutely don’t trust it for. It generates a very convincing-sounding report but for a lot of tasks I’ve found the numbers won’t reasonably match up to my own.
1 reply →
validating the info it gives in a response to a question like this sounds like it would be extremely tedious, unless you already had a hand-curated data set to answer it.
did you? did the data match?
I don't have any good idea of what are "good" prompts for demonstrating such models. But what I would ask such a model is the following. I have no idea if it would fall on it's face or not.
Can you write a version of Chorin's projection method for the Navier-Stokes equations that is both explicit and second order in time?
Ideally the model should not need a more detailed prompt than this. A first-year grad student in numerical analysis certainly would not.
Try pasting in a HN thread where people are disagreeing with each other vehemently and ask it for a critique or a breakdown.
An example from Sonnet 4 'thinking':
Thread
* https://imgur.com/aFl9uiA
This is just a trivial way to illustrate some capability, it is not meant to be deep or insightful or an end-task in itself.
this is good enough for me, even if its not solving your problem. it gives you option and fills some information void