← Back to context

Comment by simonw

2 days ago

Something I like about this piece is how much in reinforces the idea that models like o3 Pro are really hard to get good results out of.

I don't have an intuition at all for when I would turn to o3 Pro yet. What kind of problems do I have where outsourcing to a huge model that crunches for several minutes are worthwhile?

I'm enjoying regular o3 a lot right now, especially with the huge price drop from the other day. o3 Pro is a lot harder to get my head around.

Yesterday I asked 2.5 Pro, Opus 4, and o3 to convert my Pytorch script from pipeline parallel to regular DDP (convert one form of multi-GPU execution to another). None of the three produced fully correct code. Even when I put together the 3 different versions they produced, and gave it to each model again to analyze the differences, they still could not fully get it to work.

I don't know if o3 Pro would solve my task, but I feel we're still pretty far from the state where I'd struggle to give it a challenging enough problem.

  • That's not how you do it. Ask it first to create exhaustive tests around the first version. Tell it what to test for. Then, ask to change specific things, one at a time, re-run tests between the steps, and ask it to fix things. Rinse-repeat-review. It is faster than doing by hand, but you still need to be calling the shots.

  • I'm curious how you're prompting. I've performed this sort of dramatic update in both one-shot (Gemini 2.5/o3) and Leader/Agent (ask 2.5/o3 for a detailed roadmap) and then provide that to Claude to execute as an agent.

    I find the key is being able to submit your entire codebase to the API as the context. I've only had one situation where the input tokens were beyond o3's limit. In most projects that I work with, a given module and all relevant modules clocks in around 50-100k tokens.

    When calling via API, it also means you want to provide the full documentation for the task if it's a new API, etc. This is where the recent o3 price decrease is a godsend.

    • >I find the key is being able to submit your entire codebase to the API as the context

      Am I the only person who works on proprietary code bases? This would get me fired.

  • You tried to one-shot it? Because context and access to troubleshooting tools is of utmost importance to get good results.

Would o3 pro be the first one that can reliably understand a gigantic congressional bill, to the point where it could analyze and warn of side effects?

  • Would require the bill to be short, or otherwise made ingestible. And also would require an analysis of relevant inter-related statutes and precedents.

    Legal analysis is challenging because it's like wordier code.

    the "Big Beautiful Bill" is 350K tokens. O3 Pro's context window is 200K, but you also lose performance as you get closer to the max.

    It could analyze a section but you still have the challenge of finding relevant laws and precedents.

Same here, I’d be very interested to learn how others decide what model to use for which tasks.

I find these minutes-long iterations so painful that in practice I always go for the fast non-reasoning models.

  • Imagine a tricky distributed systems problem where you feed all of the context of your setup to the LLM and it uses the advanced reasoning to diagnose possible avenues. I did that recently with a frontier model to unwrap some very tricky istio related connection pooling issues causing syn/ack floods.

    For coding I usually use the fast frontier model like o4minihigh, but I bust out the fancy research models when I want things like general architecture and design feedbacks that require broader advanced reasoning

  • I don't often have LLMs write a lot of code for me, but when I do, I don't mind waiting a couple more minutes for a result that will waste less of my time in debugging when I try to use it.

    Also it's useful to have models review code that I wrote -- in some cases years ago -- to uncover old bugs. Current models are generally far too eager to say "Yup! Looks good! You da man!" when there are actually serious flaws in the code they are reviewing. So again, this is a task that justifies use of the most powerful models currently available, and that doesn't have to run in real time.

Something that comes to mind – I code with for platform that doesn't have a lot of source code or documentation simply available online for training; I have to provide a lot of context. A lot more inference lets it combine its general knowledge about systems programming to this really niche domain, with a lot less hallucination and a lot more systematic reasoning.

Random thought: dump your knowledge base into it (Obsidian, ...) and ask it to reorganize it, delete duplication, obsolete stuff, optimize it.

Or tell it what you know about non-programming subject X, and ask it to explain it to you such that you understand it better.

Or for coding: ask it to look at your code, and suggest large scale architecture changes.

For these kinds of tasks, the models are still lacking.