Comment by anonymid

7 days ago

Thanks for reading, and I really appreciate your comments!

> who feed their produced tokens back as inputs, and whose tuning effectively rewards it for doing this skillfully

Ah, this is a great point, and not something that I considered. I agree that the token feedback does change the complexity, and it seems that there's even a paper by the same authors about this very thing! https://arxiv.org/abs/2310.07923

I'll have to think on how that changes things. I think it does take the wind out of the architecture argument as it's currently stated, or at least makes it a lot more challenging. I'll consider myself a victim of media hype on this, as I was pretty sold on this line of argument after reading this article https://www.wired.com/story/ai-agents-math-doesnt-add-up/ and the paper https://arxiv.org/pdf/2507.07505 ... who brush this off with:

>Can the additional think tokens provide the necessary complexity to correctly solve a problem of higher complexity? We don't believe so, for two fundamental reasons: one that the base operation in these reasoning LLMs still carries the complexity discussed above, and the computation needed to correctly carry out that very step can be one of a higher complexity (ref our examples above), and secondly, the token budget for reasoning steps is far smaller than what would be necessary to carry out many complex tasks.

In hindsight, this doesn't really address the challenge.

My immediate next thought is - even solutions up to P can be represented within the model / CoT, do we actually feel like we are moving towards generalized solutions, or that the solution space is navigable through reinforcement learning? I'm genuinely not sure about where I stand on this.

> I don't have an opinion on this, but I'd like to hear more about this take.

I'll think about it and write some more on this.

This whole conversation is pretty much over my head, but I just wanted to give you props for the way you're engaging with challenges to your ideas!

You seem to have a lot of theoretical knowledge on this, but have you tried Claude or codex in the past month or two?

Hands on experience is better than reading articles.

I've been coding for 40 years and after a few months getting familiar with these tools, this feels really big. Like how the internet felt in 1994.

  • I've been developing an ai coding harness https://github.com/dlants/magenta.nvim for over a year now, and I use it (and cursor and claude code) daily at work.

    Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.

    I've experimented with providing a more robust dsl for text manipulation https://github.com/dlants/magenta.nvim/blob/main/node/tools/... , and I do think it's an improvement over just straight search/replace, but the agents do tend to struggle a lot - editing the wrong line, messing up the selection state, etc... which is probably why the major players haven't adopted something like this yet.

    So I feel pretty confident in my assessment of where these models are at!

    And also, I fully believe it's big. It's a huge deal! My work is unrecognizable from what it was even 2 years ago. But that's an impact / productivity argument, not an argument about intelligence. Modern programming languages, IDEs, spreadsheets, etc... also made a fundamental shift in what being a software engineer was like, but they were not generally intelligent.

    • > Fun observation - almost every coding harness (claude code, cursor, codex) uses a find/replace tool as the primary way of interacting with code. This requires the agent to fully type out the code it's trying to edit, including several lines of context around the edit. This is really inefficient, token wise! Why does it work this way? Because the LLMs are really bad at counting lines, or using other ways of describing a unique location in the file.

      Incidentally, I saw an interesting article about exactly this subject a little ways back, using line numbers + hashes instead of typing out the full search/replace, writing patches, or doing a DSL, and it seemed to have really good success:

      https://blog.can.ac/2026/02/12/the-harness-problem/

It's general-purpose enough to do web development. How far can you get from writing programs and seeing if you get the answers you intended? If English words are "grounded" by programming, system administration, and browsing websites, is that good enough?