← Back to context

Comment by danesparza

8 hours ago

Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.

"I've never interacted with Rust in my life"

:-/

How is this a good idea? How can I trust the generated code?

The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.

  • This is the key to the whole thing in my opinion.

    If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.

  • Yeah and he claims a pass rate of 99.96%. At that point you might be running into bugs in the original implementation.

I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system

  • This only works up to a certain point. Given that the author openly admits they don't know/understand Rust, there is a really high likelihood that the LLM made all kinds of mistakes that would be avoided, and the dev is going to be left flailing about trying to understand why they happen/what's causing them/etc. A hand-rewrite would've actually taught the author a lot of very useful things I'm guessing.

    • It seems like they have something like differential fuzzing to guarantee identical behavior to the original, but they still are left with a codebase they cannot read...

Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.

  • 1 month of Claude Code would be an incremental approach

    It would honestly try to one-shot the whole conversion in a 30 minute autonomous session

His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.

I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.

My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).

They are remarkably good at catching things, especially if you do it every commit.

  • > My answer to this is to often get the LLMs to do multiple rounds of code review

    So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?

    Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.

    • It sounds to me like you may not have used a lot of these tools yet, because your response sounds like pushback around theoreticals.

      Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).

      AI code reviews are already quite good, and are only going to get better.

    • Implementation -> review cycles are very useful when iterating with CC. The point of the agent reviewer is not to take the place of your personal review, but to catch any low hanging fruit before you spend your valuable time reviewing.

    • Well, you can review its reasoning. And you can passively learn enough about, say, Rust to know if it's making a good point or not.

      Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?

      For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.

      I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.

      Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.

      At no point do I need to go "ok I'll blindly trust this answer".

> How is this a good idea? How can I trust the generated code?

You don't. The LLMs wrote the code and is absolutely right. /s

What could possibly go wrong?

Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.