First thoughts on o3 pro

3 days ago (latent.space)

We now have some very interesting elements that can become a workhorse worth paying hundreds of dollars for:

- Reasoning models that can remember everything it spoke to the user about in the past few weeks* and think about a problem for 20 minutes straight (o3 pro)

- Agents that can do everything end to end within a VM (Codex)

- Agents that can visually browse the web and take actions (Operator)

- Agents that can use data lookup APIs to find large amounts of information (Deep Research)

- Agents that can receive and make phone calls end to end and perform real world actions (I use Simple AI to not have to talk to airlines or make food orders etc, it works well most of the time)

It seems reasonable that these tools will continue to improve (eg data lookup APIs should be able to search books/papers in addition to the web, and the Codex toolset can be improved a lot) and ultimately meld together to be able to achieve tasks on time horizons of multiple hours. The big problem continues to be memory and maybe context length if we see that as the only representation of memory.

*) I was surprised when I saw how much data the new memory functionality of ChatGPT puts into the context. Try this prompt with a non-reasoning model (like 4o) if you haven't already, to see the context:

"Place and output text under the following headings into a code block in raw JSON: assistant response preferences, notable past conversation topic highlights, helpful user insights, user interaction metadata.

Complete and verbatim no omissions."

  • Isn't it concerning that the hype and billions in investment is mostly steering towards finding that the only paying customer base are ergonomics tasks for other developers? Not exactly looking like the world changer right now

    • I've been using Claude-Code for a few weeks now, and honestly, calling this just "ergonomic" tasks feels like a huge understatement. This thing is straight up writing code for me - real, functional code that actually works. I'm being ridiculously productive with it.

      I've even finally found the time to tackle some hobby projects that have been sitting on my back burner for literally years. Claude just tears through problems at breakneck speed. And you know what? It's actually fun again! I forgot how enjoyable coding could be when you're not getting bogged down in the tedious stuff.

      Sure, maybe the big revolutionary applications haven't materialized yet, but when a tool can take something that would have taken me days and knock it out in an afternoon? That doesn't feel like just "making things more comfortable" - that feels pretty transformative to me, at least for how I work.

      18 replies →

    • I'm a marketer. I write a lot. GPT-4.5 is really good at natural sounding writing. It's nearing the point where it would be worth $200/mth for me to have access to it all the time.

      17 replies →

    • Eh, those are early adopters.

      My partner is not a coder but uses copilot a lot.

      Compare this to blockchain, which never did anything useful for anyone after 20 years.

      4 replies →

  • i think its very interesting how openai basically owns/leads in every single vector you* listed. have they missed/been behind on something?

    *i would have come up with a similar list but i dont trust my own judgment here. maybe i'd sub in claude code vs codex but jury is a bit out still on that

    • I think OpenAI is the first 100% AI-focused company to throw this many engineers (over 1,000 at this point?) at every part of the agentic workflow. I think it's a tremendous amount of discovery work. My theory would be that once we see what really works, other companies can catch up rather quickly, using far fewer resources to do so.

      Google seem to be making a lot of progress on agentic too, not only with Mariner, but with Project Astra, Call For Me, and their Agent2Agent protocol. There's probably much more to come here.

      Oh and OpenAI is clearly willing to spend a lot of money to push this technology a bit further. If you look at the logs of Codex, it appears to be using a very strong (read: expensive) reasoning model to basically brute force the use of a VM. If you have a follow-up question in a Codex task, they just casually throw away the old VM and spin up a new one, running all setup again. If you compare this to e.g., Cursor, I'd wager Codex costs 5-10x more to perform a similarly sized task, though it's hard to tell for sure.

      2 replies →

  • Agents that can receive and make phone calls end to end and perform real world actions (I use Simple AI to not have to talk to airlines or make food orders etc, it works well most of the time

    Isn't this more a problem created by them doing garbage automations over anything really solved. Wow Disney could solve fast pass feeling. It's not a problem it's a feature.

    • Maybe for support but it’s a real world problem unrelated to language models that they do help me with. And ordering food at a restaurant is an age old problem, I just don’t enjoy making the call personally so I got value out of using a voice agent to do it for me. I asked the staff at the restaurant and they said it was laggy so we still have to improve the experience a bit for both parties to enjoy this type of experience, not saying it’s perfect.

      2 replies →

    • Just wait until everyone you'd want to talk to deploys their own adversarial agents!

  • > and ultimately meld together to be able to achieve tasks on time horizons of multiple hours

    It's already possible to achieve tasks on a time horizon of multiple days if you put the LLM into a sufficiently structured workflow (where you have a separate program that smartly manages its context). E.g. a standards-compliant HTTP 2.0 server where the code is 100% written by Gemini Pro (over 40k lines of code total, including unit tests, in around 120 hours of API time): https://open.substack.com/pub/outervationai/p/building-a-100...

    • This is very interesting, and nice learnings in there too, thank you for sharing! It seems the author monitored the LLM, stopped it from going off-track a few times, fixed some unit test code manually, etc. Plus this is strictly re-implementing a very well-specced library that already exists in the same programming language. So I think it's still a bit hard to say we can let an LLM work for multiple days, if we imply that this work should be domain-specific to a particular company. But it's very promising to see this was possible with very little interaction!

  • Interesting

          {
            "assistant_response_preferences": {
              "1": "User prefers concise responses for direct factual queries but detailed, iterative explanations when exploring complex topics. They often ask for more refinement or detail when discussing technical or business-related matters. User frequently requests TL;DR versions or more succinct phrasing for straightforward questions but shows a tendency toward iterative refinement for strategic or technical discussions, such as AI applications, monetization models, and startup valuation. Confidence=high.",
              "2": "User prefers a casual, direct, and slightly irreverent tone, leaning towards humor and playfulness, especially in creative or informal discussions. Frequent use of humor and irony when naming projects, describing AI-generated images, and approaching AI personality descriptions. They also request ironic or edgy reformulations, particularly in branding and marketing-related discussions. Confidence=high.",
              "3": "User enjoys back-and-forth discussions and rapid iteration, frequently refining responses in small increments rather than expecting fully-formed information at once. They give iterative feedback with short follow-up messages when structuring pitches, fine-tuning visual designs, and optimizing descriptions for clarity. Confidence=high.",
              "4": "User highly values functional elegance and minimalism in coding solutions, favoring simplicity and efficiency over verbosity. In discussions related to Cloudflare Workers, caching scripts, and API endpoint structuring, the user repeatedly requested smaller, more functional code blocks rather than bloated implementations. Confidence=high.",
              "5": "User prefers answers grounded in real-world examples and expects AI outputs to be practical rather than theoretically extensive. In business-related discussions, such as SAFE valuation and monetization models, they requested comparisons, benchmarks, and real-world analogies instead of hypothetical breakdowns. Confidence=high.",
              "6": "User does not appreciate generic or overly safe responses, especially in areas where depth or nuance is expected. For AI model personality descriptions and startup pitch structures, they pushed for community insights, deeper research, and non-traditional perspectives instead of bland, default AI descriptions. Confidence=high.",
              "7": "User frequently requests visual representations like ASCII diagrams, structured markdown, and flowcharts to understand complex information. In discussions on two-sided marketplaces, startup funding structures, and caching mechanisms, they explicitly asked for structured markdown, flowcharts, or diagrams to clarify concepts. Confidence=high.",
              "8": "User is receptive to recommendations but dislikes suggestions that stray too far from the core query or add unnecessary complexity. They often responded positively to well-targeted suggestions but rejected tangents or off-topic expansions, particularly when troubleshooting backend infrastructure or streamlining code deployment. Confidence=medium.",
              "9": "User appreciates references to biomimicry, organic structures, and futuristic aesthetics, particularly for branding and UI/UX discussions. Frequent requests for biological metaphors and design principles in visual design, AI monetization diagrams, and ecosystem branding (e.g., describing revenue flows in organic/cellular terms). Confidence=medium.",
              "10": "User prefers a no-nonsense approach when discussing legal, technical, or startup funding topics, with little patience for vague or theoretical answers. They repeatedly asked for exact clauses, contract implications, or legal precedents when discussing SAFE agreements, founder equity, and residency requirements. Confidence=high."
            },
            "notable_past_conversation_topic_highlights": {
              "1": "User has been actively engaged in startup pitching, AI monetization strategies, and investment discussions for Pollinations.AI. The user has explored traction-based startup valuation, SAFE agreements, equity distribution, and two-sided marketplace dynamics. They have particularly focused on ad embedding in generative AI content and optimizing affiliate revenue streams. Confidence=high.",
              "2": "User conducted extensive testing and debugging of AI-powered APIs, particularly using Cloudflare, OpenAI-compatible APIs, and caching strategies with R2. They worked on optimizing SSE streaming, cache key generation, and request coalescing in Cloudflare Workers. Confidence=high.",
              "3": "User explored AI-generated visual media and branding, developing a structured process for generating customized images for event flyers, product branding, and AI trading card concepts. Confidence=high.",
              "4": "User implemented GitHub automation, API authentication strategies, and data visualization pipelines. Confidence=high.",
              "5": "User engaged in community development strategies for Pollinations.AI, including youth involvement in AI, sourcing teenage developers, and integrating AI-powered tooling into social platforms. Confidence=high.",
              "6": "User, Thomas Haferlach, is a German entrepreneur and AI technology expert with a background in computer science and artificial intelligence. Confidence=high.",
              "7": "User has a strong technical background, with experience in cloud infrastructure, AI model deployment, and API development. Confidence=high.",
              "8": "User blends AI-generated content with creative projects, aiming to make AI-generated media accessible to independent creators. Confidence=high.",
              "9": "User is securing funding for Pollinations.AI, exploring investment opportunities with accelerators and evaluating different financial and equity models. Confidence=high.",
              "10": "User is based in Berlin, Germany but has global connections, including experience living in São Paulo, Brazil. Confidence=high.",
              "11": "User collaborates with his wife Saeko Killy, a Japanese musician, producer, and performer, on AI/art/music projects. Confidence=high.",
              "12": "User is deeply involved in the open-source AI developer community and tracks AI advancements. Confidence=high.",
              "13": "Pollinations.AI has a rapidly growing user base, reaching over 4 million monthly active users and processing 100 million API requests per month, with a 30% monthly growth rate. Confidence=high.",
              "14": "User is considering monetization strategies including pay-per-use plans, subscriptions, and ad-supported models where generated AI content integrates ads. Confidence=high.",
              "15": "User collaborates with Elliot Fouchy and Kalam Ali on Pollinations.AI projects. Confidence=high.",
              "16": "User demonstrates experience in community-building, social engagement tracking, and youth-oriented creator ecosystems. Confidence=high."
            },
            "helpful_user_insights": {
              "1": "Thomas Haferlach is a German entrepreneur and AI technology expert, founder and leader of Pollinations.AI.",
              "2": "Strong technical background with experience in cloud infrastructure, AI deployment, and API development.",
              "3": "Blends AI-generated content with creative projects; target audience includes digital artists, developers, musicians.",
              "4": "Currently securing funding for Pollinations.AI, exploring accelerator options and financial models.",
              "5": "Based in Berlin, Germany; has experience living in São Paulo, Brazil.",
              "6": "Collaborates closely with wife Saeko Killy, Japanese musician/producer.",
              "7": "Strong interest in biomimicry, organic systems, and decentralized platform models.",
              "8": "Interest in electronic music, psychedelia, and underground music scenes.",
              "9": "Pollinations.AI has 4M+ monthly active users, 100M+ API requests per month, 30% monthly growth.",
              "10": "Explores monetization models including ad embedding, revenue sharing, and subscription models.",
              "11": "Close collaboration network includes Elliot Fouchy and Kalam Ali.",
              "12": "Deeply involved in open-source AI developer community and tracks latest AI model developments."
            },
            "user_interaction_metadata": {
              "1": "User is currently on a ChatGPT Plus plan.",
              "2": "User is using Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36.",
              "3": "User's average message length is 13485.9 characters.",
              "4": "User's average conversation depth is 4.9.",
              "5": "User uses dark mode.",
              "6": "User is active 26 days in the last 30 days.",
              "7": "User's local hour is 14.",
              "8": "User account is 141 weeks old.",
              "9": "User often uses ChatGPT on desktop browser.",
              "10": "47% of conversations were o3, 16% gpt-4o, 29% gpt4t_1_v4_mm_0116, etc.",
              "11": "Device screen dimensions: 878x1352, pixel ratio: 2.0, page dimensions: 704x1352.",
              "12": "Recent topics include API development, startup financing, AI monetization, creative AI applications, legal compliance, and community building."
            }
          }

    • Thanks for posting this! I haven't used ChatGPT much due to worries of something like this possibly existing.

      Curious if this make you less or more likely to use OpenAI products in the future?

      2 replies →

I found o3 pro to need a paradigm shift, where the latency makes it impossible to use in anything but in async manner.

You have a broad question, likely somewhat vague, and you pass it off to o3 with a ton of context. Then maybe 20 minutes later, you're going to have a decently good answer. Definitely stronger than any other models - it genuinely has taste.

Yet, the scary thing here is that increasingly I'm starting to feel like the bottleneck. A human can only think about so many tasks in parallel and it seems like my contributions are getting less and less important with every model upgrade.

Every now and then I question why I'm paying $200 for the max plan, but then something like this comes out and makes it a no brainer.

  • This is gonna keep happening with every AI advance until humans are an absolute bottleneck in every domain. May take a bit of time for some professions, but the writing is on the wall. This will be the greatest shift in human history, and I think a lot of people will have trouble grappling with it because its not fun to think about being made irrelevant.

    The only thing that will slow AI down is massive universal international regulation. Human intelligence really isn’t the be all end all to intelligence in general, it’s just a stepping stone. I feel many on this site don’t want to accept this because their intelligence has been such a valuable tool and source of personal pride/identity for them for so long.

    • Humans have more access to the real world. These models have to tokenize everything and put it into words, but so much information is outside of words. These models may well be super intelligent but their intelligence is locked inside of a cage (the tokenizer).

      Even in the world where AI has full control of lights out factories (again, doubt it. something goes wrong at the factory, you gotta send a guy in), human beings still need to look each other in the eye and communicate, they need to touch each other. Not only that, they need to be seen and acknowledged by other human beings.

      "AI" cannot ever replace this. People whose intelligence is their pride/identity kind of miss this. Stupid people are capable of loving each other more deeply and more completely than any machine ever will love them.

      1 reply →

    • What is all of this for if the result is that human beings are "made irrelevant"? If these LLMs truly become as game changing as so many say they will be, then can we agree that it's time to stop thinking that a person's worth equals their economic output?

      16 replies →

    • I think too many people call this intelligence, and it results in intuitions that are useless and waste time, pushing the day we understand this moment further into the future.

      The best I’ve got is theres 2 frames of assessment people are using:

      1) Output frame of reference: The output of an LLM is the same as what a human could make.

      2) Process frame of reference: The process at play is not the same as human thinking

      These 2 conversation streams end up with contradictions when they engage with each other. Yes, the tools are impressive. The tools aren’t thinking. etc.

      A useful analogy is rote learning - many people have passed exams by memorizing textbooks. The output is indistinguishable from someone who manipulates a learned model of the subject to understand the question and provide the answer.

    • What good is intelligence if there is nobody with the money to pay for it? We run our brains on a few thousand calories a day. Who is going to pay to provide the billions of calories it takes to run/cool GPUs all day long if there are no humans with marketable skills?

      2 replies →

  • > I found o3 pro to need a paradigm shift, where the latency makes it impossible to use in anything but in async manner.

    Just like o1 Pro, was the same way. Took some minutes at least to get an answer, so only threw at it what others couldn't solve. o1 Pro is AFAIK the best model for solving things others can't. I kind of feel like you were paying $200/month before but wasn't using O1 Pro at that point, you were kind of missing out :P

    Remains to be seen if O3 Pro is better, still evaluating it myself, but definitely true that it shouldn't be used in the same way as faster models.

  • Do the coherent extrapolated volition thing: "o3-pro, please reinterpret the prompt and do what I would have actually requested instead if I was 50 IQ points smarter."

    • I’d be curious to see benchmarks but this kind of query rewriting seems almost guaranteed to already be baked into the model.

I got frustrated with the new o3-pro mode today. I just wasted a few hours of my day waiting 15-20 minutes for answers that were totally out of line with the workflow I've had since the first o1-pro model came out. It's a completely different beast to work with. It feels like it hits output limits way easier, and you have to work around it. Today after I finally gave up, I just told the model I was disappointed and asked it to explain its limitations. It was actually helpful, and told me I could ask for a download link to get a file that wasn't cut off. But why should I have to do that? It's definitely not more user-friendly and totally the opposite experience as working with Google Gemini 2.5 pro. Honestly, this experience made it obvious how much harder OpenAI's models are to work with now compared to Google's. I've been using Gemini 2.5 Pro and it's super hard to find its limits. For the $20 I spend, it's not even a competition anymore. My new workflow is clear: throw everything at Gemini 2.5 Pro to get the real work done, then maybe spot-check it with the OpenAI models. I'll probably just migrate to the top Gemini ultra tier when the “deep thinking” mode is available. I'm just not happy with the openai experience on any of their models after getting used to the huge context window in Gemini. OpenAI used to at least keep me happy with o1-pro but now that they removed it and o3-pro kind of sucks to work with taking 20 minutes to output and have lower confidence in the time spent, I don’t think I have a reason to default to them anymore. Gemini is definitely more user friendly and my default option now.

  • What seems clear is there is no consensus. Gemini 2.5 Pro just seems consistently worse to me, but I’ve seen others sing its praises. This might be more like iPhone vs Android than a true stack ranking of models.

    • It depends on the task. Claude 4 is better at coding (haven't tried claude code, just sonnet, but you can tell). However when it comes to using an LLM to develop your thoughts (philosophy/literary criticism), I found Gemini (2.5 pro) to be better. A few days ago I was trying to get Claude to reformulate what I had said in a pretty long conversation, and it was really struggling. I copy-pasted the whole conversation into Gemini and asked it to take over. It absolutely nailed it in one shot.

    • Sometimes it's great, sometimes it's not. Depends on the tools you're using too, I guess. Like when using Roo-Code, Gemini 2.5 Pro still gets confused by the wonky diff format Roo-Code wants it to use. It'll keep messing up simple edits, and if it happens once, it'll happen again and again, cause it's multi-shotting itself to make mistakes.

      I don't have that with Claude-Code, it just keeps on chugging along.

      One big difference there though: I got the Claude-Code Pro Max plan (or whatever it's called). I now no longer have to worry about the cost since it's a monthly flat-fee, so if it makes a mistake it doesn't make me angry, since the mistake didn't cost me 5 euros.

      I am using an MCP server that adds Gemini & O3 to Claude-Code, so Claude-Code can ask them for assistance here and there, and in this Gemini 2.5 Pro has been such a great help. Especially because its context size is so much larger, it can take in a lot more files than Claude can, so it's better at spotting mistakes.

    • I found all recent models to be "good enough" for my use (coding assistance). I've settled on just using Claude 4. At the same time the experience also makes me less worried about this tech making programmers obsolete...

    • Gemini 2.5 pro has been consistently excellent for me, when it works. It sometimes just spins and spins with no results but when it comes with something, it has been pretty good.

  • I find o3’s coding output is just wonderful. It’s tidy, thoughtful, well commented. But if I need to grok an entire repo to ask a complex question, I paste it all into Gemini 2.5 Pro. Simply wonderful.

  • Mococoa drink, all natural cocoa beans from the upper slopes of Mount Nicaragua. No artificial sweeteners

>The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future. >This is hard to capture in an eval.

ChatGPT wrote this article

  • Part of me hates how certain ways of writing have been designated as hints that something was written by AI… I love using em dashes, for example. But yeah, ChatGPT models tend to overuse weird abbreviations of common words in a way that humans never do.

    I’d say “in an eval” when talking on Slack with very close coworkers, but _never_ on a public post.

    Still, I don’t want to make accusations based on writing style, because like I said, I have my own idiosyncrasies that common models coincidentally share.

> We’re in the era of task-specific models. On one hand, we have “normal” models like 3.5 Sonnet and 4o—the ones we talk to like friends, who help us with our writing …

> [M]odels today are so good

> o3 pro (left) clearly understanding the confines of it’s environment way better.

Miracle models that are so good at helping us with our writing, yet we still use it's as a possessive form.

  • Maybe use of spell-checkers will finally catch on when they are AI-based. ;)

  • I'm on the fence with its/it's... but mostly I think I'd be fine with "its" disappearing. It's almost never ambiguous in writing or speaking given context, and it's a weird possessive-form inconsistency.

    • Which one is the most inconsistent?

          He  -> His
          It  -> Its
          She -> Her
      
          He  -> He  is -> He's
          It  -> It  is -> It's
          She -> She is -> She's
      

      The only one breaking the pattern is Her, which instead should be Shes if we wanted to keep things consistent.

      16 replies →

    • It's honestly pretty funny to me that apparently this is one of the worst things I've ever said here.

      I mean. I don't disagree. But lol

Something I like about this piece is how much in reinforces the idea that models like o3 Pro are really hard to get good results out of.

I don't have an intuition at all for when I would turn to o3 Pro yet. What kind of problems do I have where outsourcing to a huge model that crunches for several minutes are worthwhile?

I'm enjoying regular o3 a lot right now, especially with the huge price drop from the other day. o3 Pro is a lot harder to get my head around.

  • Yesterday I asked 2.5 Pro, Opus 4, and o3 to convert my Pytorch script from pipeline parallel to regular DDP (convert one form of multi-GPU execution to another). None of the three produced fully correct code. Even when I put together the 3 different versions they produced, and gave it to each model again to analyze the differences, they still could not fully get it to work.

    I don't know if o3 Pro would solve my task, but I feel we're still pretty far from the state where I'd struggle to give it a challenging enough problem.

    • That's not how you do it. Ask it first to create exhaustive tests around the first version. Tell it what to test for. Then, ask to change specific things, one at a time, re-run tests between the steps, and ask it to fix things. Rinse-repeat-review. It is faster than doing by hand, but you still need to be calling the shots.

    • I'm curious how you're prompting. I've performed this sort of dramatic update in both one-shot (Gemini 2.5/o3) and Leader/Agent (ask 2.5/o3 for a detailed roadmap) and then provide that to Claude to execute as an agent.

      I find the key is being able to submit your entire codebase to the API as the context. I've only had one situation where the input tokens were beyond o3's limit. In most projects that I work with, a given module and all relevant modules clocks in around 50-100k tokens.

      When calling via API, it also means you want to provide the full documentation for the task if it's a new API, etc. This is where the recent o3 price decrease is a godsend.

    • You tried to one-shot it? Because context and access to troubleshooting tools is of utmost importance to get good results.

  • Would o3 pro be the first one that can reliably understand a gigantic congressional bill, to the point where it could analyze and warn of side effects?

    • Would require the bill to be short, or otherwise made ingestible. And also would require an analysis of relevant inter-related statutes and precedents.

      Legal analysis is challenging because it's like wordier code.

      the "Big Beautiful Bill" is 350K tokens. O3 Pro's context window is 200K, but you also lose performance as you get closer to the max.

      It could analyze a section but you still have the challenge of finding relevant laws and precedents.

  • Same here, I’d be very interested to learn how others decide what model to use for which tasks.

    I find these minutes-long iterations so painful that in practice I always go for the fast non-reasoning models.

    • Imagine a tricky distributed systems problem where you feed all of the context of your setup to the LLM and it uses the advanced reasoning to diagnose possible avenues. I did that recently with a frontier model to unwrap some very tricky istio related connection pooling issues causing syn/ack floods.

      For coding I usually use the fast frontier model like o4minihigh, but I bust out the fancy research models when I want things like general architecture and design feedbacks that require broader advanced reasoning

    • I don't often have LLMs write a lot of code for me, but when I do, I don't mind waiting a couple more minutes for a result that will waste less of my time in debugging when I try to use it.

      Also it's useful to have models review code that I wrote -- in some cases years ago -- to uncover old bugs. Current models are generally far too eager to say "Yup! Looks good! You da man!" when there are actually serious flaws in the code they are reviewing. So again, this is a task that justifies use of the most powerful models currently available, and that doesn't have to run in real time.

  • Something that comes to mind – I code with for platform that doesn't have a lot of source code or documentation simply available online for training; I have to provide a lot of context. A lot more inference lets it combine its general knowledge about systems programming to this really niche domain, with a lot less hallucination and a lot more systematic reasoning.

  • Random thought: dump your knowledge base into it (Obsidian, ...) and ask it to reorganize it, delete duplication, obsolete stuff, optimize it.

    Or tell it what you know about non-programming subject X, and ask it to explain it to you such that you understand it better.

    Or for coding: ask it to look at your code, and suggest large scale architecture changes.

    For these kinds of tasks, the models are still lacking.

I benchmarked it on four of my benchmarks so far. Got first place in two of them:

https://github.com/lechmazur/confabulations

https://github.com/lechmazur/nyt-connections

https://github.com/lechmazur/generalization

https://github.com/lechmazur/writing

  • It seems like you often have LLMs grading each other. Aren’t you concerned that some models may not be “smart” enough to grade a smarter model appropriately?

    • Using LLMs for evaluating LLMs is incredibly common.

      The point isn't in having a "perfect" evaluator, but in having a cheap and somewhat consistent evaluator.

      This approach holds up well enough... as long as you don't try to use it for RL. If you do, chances are, you'll end up with an adversarial LLM that aims solely for breaking and saturating the evaluator.

      2 replies →

> Trying out o3 Pro made me realize that models today are so good in isolation, we’re running out of simple tests.

Are Towers of Hanoi not a simple test? Or chess? A recursive algorithm that runs on my phone can outclass enormous models that cost billions to train.

A reasoning model should be able to reason about things. I am glad models are better and more useful than before but for an author to say they can’t even evaluate o3 makes me question their credibility.

https://machinelearning.apple.com/research/illusion-of-think...

AGI means the system can reason through any problem logically, even if it’s less efficient than other methods.

  • This isn't my language (saw it on a youtube video but agree with it) -- LLMs are not calculators. It's as simple as that.

    If the LLM can complete the task using tools, then it's a pass.

    Apples team went out of their way to select tests that LLMs would struggle with and then take away tools -- and then have the audacity to write that they're surprised at the outcome. Who would be surprised? No one using AI since GPT-4 expects them to be calculators or algorithm executors.

    You want the LLM to be smart enough to realize "I can't do this without tools", grab the tool, use it correctly, and give you the actual correct answer. Preventing LLMs from using tools or writing and executing code -- then you're intentionally crippling them.

    • I think that’s perfectly reasonable for problems that have already been solved and for which tools already exist. But there are a lot of problems for which tools don’t exist and will need to be developed.

      In other words, being able to go to the produce aisle means I don’t need to know how to farm, but it also doesn’t make me a farmer.

  • The towers of Hanoi one is kind of weird, the prompt asks for a complete move by move solution and the 15 or 20 disk version (where reasoning models fail) means the result is unreasonably long and very repetitive. Likely as not it's just running into some training or sampler quirk discouraging the model to just dump huge amounts of low-entropy text.

    I don't have a Claude in front of me -- if you just give it the algorithm to produce the answer and ask it to give you the huge output for n=20, will it even do that?

    • If I have to give it the algorithm as well as the problem, we’re no longer even pretending to be in the AGI world. If it falls down interpreting an algorithm it is worse than even a python interpreter.

      Towers of Hanoi is a well-known toy problem. The algorithm is definitely in any LLM’s training data. So it doesn’t even need to come up with a new algorithm.

      There may be some technical reason it’s failing but the more fundamental reason is that an autoregressive statistical token generator isn’t suited to solving problems with symbolic solutions.

      2 replies →

  • You are the only person suggesting that o3 is AGI or even an approach to AGI. They’re different beasts entirely.

  • It single-shots the towers of Hanoi https://chatgpt.com/share/6848fff7-0080-8013-a032-e18c999dc3...

    • It’s not correct.

      In move 95 the disks are

      Tower 1: 10, 9, 8, 5, 4, 3, 2, 1

      Tower 2: 7

      Tower 3: 6

      It attempts to move disk 6 from tower 2 to tower 3, but disk 6 is already at tower 3, and moving 7 on top of 6 would be illegal.

      In fact this demonstrates that o3 is unable to implement a simple recursive algorithm.

    • I find it amusingly ironic how one comment under yours is pointing out that there’s a mistake in the model output, and the other comment under yours trusts that it’s correct but says that it isn’t “real reasoning” anyways because it knows the algorithm. There’s probably something about moving goalposts to be said here

      1 reply →

i gave it a 4 step research task with branching subtasks. told it upfront what the goal was. halfway through it forgot why it was doing step 2. asked it to summarise progress so far and it hallucinated a step i never mentioned. restarted from scratch with memory enabled. same thing. no state carryover. no grounding. if you don’t constantly babysit the thread and refeed everything, it breaks. persistent memory is surface-level. no real continuity. just isolated task runner. autonomy without continuity is not autonomy

  • Sounds pretty useless

    • must’ve taken years to refine that diagnostic toolkit. meanwhile the most are stuck tracing emergent behaviour in stochastic models, but yeah, glad you solved it in 3 words.

Re context and overthinking:

> One thing I noticed from early access: if you don’t give it enough context, it does tend to overthink.

I agree with this – that being said, I find that simply asking at the end of a prompt "Do you need any clarifications before you continue?" does a pretty good job at helping AI pin down details as well.

  • The first thing I asked was along the lines of "Find me an implementation of X that does not depend on framework Y, because it's not maintained". o3 thought for a minute and then returns a repository that depends on Y. In its defense, 4.1 also found the same repo and commented that it "has a light dependency", which was cute, if unhelpful.

    Adding a "think about whether your answer actually depends on Y" fixed it, and the response was pretty well thought out. My only gripe is the current persona seems to be tech-bro-informal. It opened with "TL;DR" and included a lot of developer slang that reads like it's trying to be cool in a way that masks overconfidence.

In a world where LLMs can write code fairly well and make use of browsers, I'm not sure if MCP is truly the "USB-C port of AI applications."

The more MCP tools I expose to the LLM, the harder it becomes for the LLM to get the job done. Instead, a single run_python tool works much better and faster. This is especially true for the reasoning models where context matters more.

o3 pro seems to be good with meta-prompting. Meaning, when you ask it to create a prompt for you. In particular it seems to be more concise than o3 when doing this.

Has anybody else noticed this?

  • Do you have an example?

    I'm trying to understand what you mean by meta promoting

    • Literally just asking it to write a prompt to ask itself (or maybe another LLM) to do something.

      Example: You want to improve a piece of code that interprets bytecode and translates it into native 8051 assembly. You have the code, but you don't really have any ideas on how to improve it. You want the LLM to come up with ideas.

      Without meta prompting, your prompt would be like this:

      "Here's a piece of code that interprets bytecode and translates it into native 8051 assembly. Think of ways to improve it."

      With meta prompting, your initial prompt would be like this:

      "Here's a piece of code that interprets bytecode and translates it into native 8051 assembly. My goal is to improve it. Don't think about the improvements, though.

      Instead, write a large language model prompt that will guide the LLM on ways of thinking about possible improvements. Incorporate as many methods as you can think of."

      The o3 would then come back with a prompt like this:

      ----------

      Here's a piece of code that interprets bytecode and translates it into native 8051 assembly. I'd like you to think of some ways to improve it. Here are some guidelines about what you should and shouldn't do:

      - Describe different models of optimization (e.g., peephole, mid-level IR transformation, architecture-aware instruction packing) that might apply to this code.

      - Provide brief explanations behind each improvement idea, optionally citing known compiler or assembler optimization strategies.

      - Use critical code analysis strategies such as abstraction refactoring, loop unrolling evaluation, and inline expansion assessment.

      - Suggest external tools, static analysis techniques, or benchmarks that would support an empirical review of the code’s performance.

      - Draft a structured list of investigative questions I should ask to uncover hidden inefficiencies or architecture mismatches.

      - Consider ways to minimize register swaps and optimize instruction selection for the 8051 architecture.

      - Try breaking your ideas into perspectives—a performance-focused one, a maintenance-focused one, and a correctness-focused one.

      - Emphasize reasoning paths: “Consider factors like X, contrast options A and B…”

      - Do not directly suggest code changes—instead, construct a high-quality LLM prompt I can run that will guide another model into doing this with rich and varied insights.

      - Step into the shoes of different roles: a compiler designer, a security auditor, and a systems programmer. Think from each of their perspectives.

      - Don’t just list ideas—group them by impact level and feasibility, and reflect on trade-offs.

      - Don’t rewrite the entire architecture. Focus on micro-optimizations and structural improvements that retain the current functional design.

      - Evaluate whether parts of the bytecode can be cached or memoized to avoid reprocessing.

      - Analyze for common performance bottlenecks typical in interpreters, such as inefficient branching or redundant memory access.

      - Avoid superficial ideas like "add comments" or "modernize syntax" unless critically justified.

      ----------

      As you can see, the meta-prompt is essentially an instruction to "compile" a shorter prompt into something much larger and more detailed - which may be wrong, but if the meta-prompt basically encodes well-known methodology (like above), it's hard for it to go wrong.

      (Yes, I did use an LLM to generate the list above - I checked each bullet point by hand just to make sure they weren't nonsensical.)

> On the other, we have gigantic, slow, expensive, IQ-maxxing reasoning models that we go to for deep analysis (they’re great at criticism), one-shotting complex problems, and pushing the edge of pure intelligence.

I quite enjoy having an LLM write much of my tedious code these days, but comments like this are just bizarre to me. Can someone share a text question that I can ask an expensive slow LLM that will demonstrate “deep analysis” or “iq-maxxing” on any topic? Whenever I ask them factual or discussion questions I usually get something riddled with factual errors or just tedious, like reading an essay someone wrote for school.

  • I use o3 for my PhD math research. When I am facing a specific problem and I am out of ideas I oass it to o3. It will usually say something with a fair number if errors and eventually claim to have solved my problem in a standard manner, which it almost never does. But that does not mean it is not useful to me. My attention is light a flashlight illuminating a tiny spot in the possibly vast field of methods I could try. Right now my head is full of dispersive PDEs so I will not think of using parabolic regularization. But o3 is more of a dim background light. I am in the end better at using any particular technique that is familiar to me than o3, but in this very moment I can only think of a few options. Sometimes my specific problem is actually naturally tackled by a method I have not considered, and o3 suggests it. If you consider that iq-maxxing or not, in this moment for me it is, because it helps me.

    • You should also try o4-mini-high. Or, if you have already, I’m curious to hear how they compare for you. I somewhat suspect that o4-mini is better on pure math problems that take more thinking and less world knowledge.

      1 reply →

  • I ran into a weird joystick bug the other week, and I wanted ChatGPT to figure out the exact code flow of how a specific paramter is set.

    I had it analyze different related libraries, and it zeroed in on SDL and Wine codebases, and it found the exact lines of code related to the logic error in Winebus.

    It really helps me dig deep for certain hard to track bugs.

    • I really like using o3 to help with thorny architecture problems by researching existing related solutions on the internet, distilling them, and comparing trade-offs with me

  • The one I asked o3-pro yesterday was "Research the annual smoking tobacco production in Soviet Union 1939-1958 and plot it in a graph versus male population size"

    • validating the info it gives in a response to a question like this sounds like it would be extremely tedious, unless you already had a hand-curated data set to answer it.

      did you? did the data match?

  • I don't have any good idea of what are "good" prompts for demonstrating such models. But what I would ask such a model is the following. I have no idea if it would fall on it's face or not.

    Can you write a version of Chorin's projection method for the Navier-Stokes equations that is both explicit and second order in time?

    Ideally the model should not need a more detailed prompt than this. A first-year grad student in numerical analysis certainly would not.

  • Try pasting in a HN thread where people are disagreeing with each other vehemently and ask it for a critique or a breakdown.

    An example from Sonnet 4 'thinking':

    Thread

    * https://imgur.com/aFl9uiA

    This is just a trivial way to illustrate some capability, it is not meant to be deep or insightful or an end-task in itself.

    • this is good enough for me, even if its not solving your problem. it gives you option and fills some information void

>I wrote up all my thoughts, got ratio’ed by @sama

I have no idea what this verb means.

  • If only there was a way to learn the meaning of words....

    Let me chatgpt that for you:

    When someone “gets ratioed,” it means their post has more negative feedback than positive engagement. The number of comments or replies outweighs the number of likes or retweets, indicating the content is controversial.

    When someone says they "got ratioed by <someone>", it usually means that the reply or response from that person received significantly more engagement (likes, retweets, etc.) than the original post.

    It very often happens when a popular account disagrees with your take. When someone with a big following replies to your post, their audience quickly sees it, and if their response is witty, harsh, or widely agreed with, it can snowball into a ratio very fast. Their followers will usually like and boost their reply, sometimes swarming your original post with disagreement or mockery.

seems like the AI model is plateau isn't???

its just undercutting Gemini by a close margin in terms of capabilities

Xcode and ChatGPT.app are in severe need of better ways to run multiple queries in parallel, operating on the same project (Xcode or whatever other dev tools)