← Back to context

Comment by amangsingh

14 days ago

A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare. Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos.

If you don't have a rigid, external state machine governing the workflow, you have to brute-force reliability. That codebase bloat is likely 90% defensive programming; frustration regexes, context sanitizers, tool-retry loops, and state rollbacks just to stop the agent from drifting or silently breaking things.

The visual map is great, but from an architectural perspective, we're still herding cats with massive code volume instead of actually governing the agents at the system level.

I find it really strange that there is so much negative commentary on the _code_, but so little commentary on the core architecture.

My takeaway from looking at the tool list is that they got the fundamental architecture right - try to create a very simple and general set of tools on the client-side (e.g. read file, output rich text, etc) so that the server can innovate rapidly without revving the client (and also so that if, say, the source code leaks, none of the secret sauce does).

Overall, when I see this I think they are focused on the right issues, and I think their tool list looks pretty simple/elegant/general. I picture the server team constantly thinking - we have these client-side tools/APIs, how can we use them optimally? How can we get more out of them. That is where the secret sauce lives.

  • The tools was mostly already known, no? (I wish they had a "present" tool which allowed to model to copy-paste from files/context/etc. showing the user some content without forcing it through the model)

    • Yeah in fact one thing claude is freaking great at is decompilation.

      If you can download it client side you can likely place a copy in a folder and ask claude

      ‘decompile the app in this folder to answer further questions on how it works. As an an example first question explain what happens when a user does X’.

      I do this with obscure video games where i want to a guide on how some mechanics work. Eg. https://pastes.io/jagged-all-69136 as a result of a session.

      It can ruin some games but despite the possibility of hallucinations i find it waaay more reliable than random internet answers.

      Works for apps too. Obfuscation doesn’t seem to stop it.

      1 reply →

  • > but so little commentary on the core architecture.

    The core architecture is not interesting? its an LLM tui, theres not much there to discuss architecturally. The code itself is the actual fascinating train wreck to look at.

  • Why are "tools" for local IO interesting and not just the only way to do it? I can't really imagine a server architecture that gets to read your local files and present them without a fat client of some kind.

    What is the naive implementation you're comparing against? Ssh access to the client machine?

    • It's early days and we don't fully understand LLM behavior to the extent that we can assume questions like this about agent design are resolved. For instance, is an agent smarter with Claude Code's tools or `exec_command` like Codex? And does that remain true for each subsequent model release?

      1 reply →

It’s not surprising. There has been quite a bit of industrial research in how to manage mere apes to be deterministic with huge software control systems, and they are an unruly bunch I assure you.

It's hard to tell how much it says about difficulty of harnessing vs how much it says about difficulty of maintaining a clean and not bloated codebase when coding with AI.

  • Why not both? AI writes bloated spaghetti by default. The control plane needs to be human-written and rigid -> at least until the state machine is solid enough to dogfood itself. Then you can safely let the AI enhance the harness from within the sandbox.

We propped the entire economy up on it. Just look at the s&p top 10. Actually even top 50 holdings.

If it doesn't deliver on the promise we have bigger problems than "oh no the code is insecure". We went from "I think this will work" to "this has to work because if it doesn't we have one of those 'you owe the bank a billion dollars' situations"

  • It's weird to look at the world like this. If they deliver doesn't that invalidate thousands of other business plans? What about paying for that?

    If they fail, doesn't software and the giant companies that make it go back to owning the world?

    • “if they deliver”

      As I’m reading this, I’m thinking about how in 1980. It was imagined that everyone needed to learn how to program in BASIC or COBOL, and that the way computers would become ubiquitous would be that everybody would be writing program programs for them. That turned out to be a quaint and optimistic idea.

      It seems like the pitch today is that every company that has a software-like need will be able to use AI to manifest that software into existence, or more generally, to manifest some kind of custom solution into existence. I don’t buy it. Coding the software has never been the true bottleneck, anyone who’s done a hackathon project knows that part can be done quickly. It’s the specifying and the maintenance that is the hard part.

      To me, the only way this will actually bear the fruit it’s promising is if they can deliver essentially AGI in a box. A company will pay to rent some units of compute that they can speak to like a person and describe the needs, and it will do anything - solve any problem - a remote worker could do. IF this is delivered, indeed it does invalidate virtually all business models overnight, as whoever hits AGI will price this rental X%[1] below what it would cost to hire humans for similar work, breaking capitalism entirely.

      [1] X = 80% below on day 1 as they’ll be so flush with VC cash, and they’d plan to raise the price later. Of course, society will collapse before then because of said breaking of capitalism itself.

      2 replies →

The time is ripe for deterministic AI; incidentally, this was also released today: https://itsid.cloud/ - presumably will be useful for anyone who wants to quickly recreate an open source Python package or other copyrighted work to change its license.

  • Can you please explain the use here? I tried the demo, and cat, cp, echo, etc... seem to do the exact same thing without the cost.

    Their demo even says:

       `Paste any code or text below. Our model will produce an AI-generated, byte-for-byte identical output.`
    
    

    Unless this is a parody site can you explain what I am missing here?

    Token echoing isn't even to the lexeme/pattern level, and not even close to WSD, Ogden's Lemma, symbol-grounding etc...

    The intentionally 'Probably approximately complete' statistical learning model work, fundamentally limits reproducibility for PAC/Stastical methods like transformers.

    CFG inherently ambiguity == post correspondence problem == halt == open domain frame-problem == system identification problem == symbol-grounding problem == entscheidungsproblem

    The only way to get around that is to construct a grammar that isn't. It will never exist for CFGs, programs, types, etc... with arbitrary input.

    I just don't see why placing a `14-billion parameter identity transformer` that just basically echos tokens is a step forward on what makes the problem hard.

    Please help me understand.

Kinda depends how much of it is vibe coded. It could easily be 5x larger than it needs to be just because the LLM felt like it if they've not been careful.

  • Claude folks proudly claim to have Claude effectively writing itself. The CEO claims it will read an issue and automatically write a fix, tests, commit and submit a PR for it.

  • Bingo. And them 'being careful' is exactly what bloats it to 500k lines. It's a ton of on-the-fly prompt engineering, context sanitizers, and probabilistic guardrails just to keep the vibes in check.

> Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos

Can you expand on this?

My experience is they require excessive steering but do not “break”

  • I think the "breakage" is in terms of conciseness and compactness, not outright brokenness.

    Like that drunk uncle that takes half an hour and 20 000 words to tell you a 500 word story.

Indeed. In some ways, this is just kind of an extrapolation of the overall trend toward extreme bloat that we’ve seen in the past 15 years, just accelerated because LLMs code a lot faster. I’m pretty accustomed to dealing with Web application code bases that are 6-10 years old, where the hacks have piled up on top of other hacks, piled on top of early, tough-to-reverse bad decisions and assumptions, and nobody has had time to go back and do major refactors. This just seems like more of the same, except now you can create a 10 year-old hack-filled code base in three hours.

  • The terrifying thing is that LLMs turn "technical debt" into "synthetic debt" that accumulates in real-time.

    When we use an agent that lacks a native way to consolidate its own context, we essentially force it to generate these 10-year-old hack-filled codebases by design. We’re over-engineering the "container" (the CLI logic) to babysit a "leaky" context.

    If the architecture doesn't start treating long-term memory as a first-class citizen, we’re just going to see more of these 500k-line "safety nets" masking the underlying fragility of the agents.

There seem to be multiple mechanisms compensating for imperfect, lossy memory. "Dreaming" is another band-aid on inability to reliably store memory without loss of precision. How lossy is this pruning process?

It's one thing to give Claude a narrow task with clear parameters, and another to watch errors or incorrect assumptions snowball as you have a more complex conversation or open-ended task.

> they break at large enterprise repos.

I don't know where you get this. you should ask folks at Meta. They are probably the biggest and happiest users of CC

You need state oriented programming to handle that. I know, because I made one. The keyword is „unpredictability”. Embrace nondeterminism.

Exactly right. Files on disk as the shared state, not the conversation window. Each step reads current state, does its job, writes output. Next step starts fresh from those files. No accumulated context means no drift, and the LLM can hallucinate in its reasoning all it wants as long as the output passes a check before anything advances.

What do you mean by "actually governing the agents at the system level", and how is it different from "herding cats"?

  • Herding cats is treating the LLM's context window as your state machine. You're constantly prompt-engineering it to remember the rules, hoping it doesn't hallucinate or silently drop constraints over a long session.

    System-level governance means the LLM is completely stripped of orchestration rights. It becomes a stateless, untrusted function. The state lives in a rigid, external database (like SQLite). The database dictates the workflow, hands the LLM a highly constrained task, and runs external validation on the output before the state is ever allowed to advance. The LLM cannot unilaterally decide a task is done.

    I got so frustrated with the former while working on a complex project that I paused it to build a CLI to enforce the latter. Planning to drop a Show HN for it later today, actually.

Thousands of developers are using Claude Code successfully (I think?).

So what specifically is the gripe? If it works, it works right?

So this is more like an art than science - and Claude Code happens to be the best at this messy art (imo).

> A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare.

Considering what the entire system ends up being capable of, 500k lines is about 0.001% of what I would have expected something like that to require 10 years ago.

You can combine that with all the training and inference code, and at the end of the day, a system that literally writes code ends up being smaller than the LibreOffice codebase.

It boggles the mind, really.

  • > You can combine that with all the training and inference code, and at the end of the day, a system that literally writes code ends up being smaller than the LibreOffice codebase.

    You really need to compare it to the model weights though. That’s the “code”.

    • >You really need to compare it to the model weights though

      Then you'd need to compare the education of any developer in relation to how many LOC their IDE is. That's the "code".

      So yea, the analogy doesn't make a whole lot of sense.

  • ... what are you even talking about? "The system that literally writes code" has a few hundreds of trillions of parameters. How is this smaller than LibreOffice?

    I know xkcd 1053, but come on.

brute-forcing pattern-matching at scale. These are brittle systems with enormous duct-taping to hold everything together. workarounds on workarounds.

[flagged]

  • If writing concise architectural analysis without the fluff makes me an AI, I'll take the complement. But no - just a tired Architect who has spent way too many hours staring at broken agent state loops haha.

    • This reply is quite literally AI as well, and so was your initial comment. It's so so obvious after spending enough time on Twitter and seeing the pattern used by all the AI reply bots. Absolutely insane that the HN crowd isn't able to see this.

>A 500k line codebase for an agent CLI proves one thing: making a probabilistic LLM behave deterministically is a massive state-management nightmare. Right now, they're great for prompting simple sites/platforms but they break at large enterprise repos.

Is that the case? I'm pretty sure Claude Code is one of the most massively successful pieces of software made in the last decade. I don't know how that proves your point. Will this codebase become unmanageable eventually? Maybe, but literally every agent harness out there is just copying their lead at this point.

  • Claude code is a massively successful generator, I use it all the time, but it's not a governance layer.

    The fact that the industry is copying a 500k-line harness is the problem. We're automating security vulnerabilities at scale because people are trying to put the guardrails inside the probabilistic code instead of strictly above it.

    Standardizing on half a million lines of defensive spaghetti is a huge liability.

    • >Standardizing on half a million lines of defensive spaghetti is a huge liability.

      Again, maybe it will be. Or maybe the way we make software and what is considered good practice will completely change with this new technology. I'm betting on the latter at this point.