Comment by danesparza

1 month ago

Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.

"I've never interacted with Rust in my life"

:-/

How is this a good idea? How can I trust the generated code?

35 comments

danesparza

johnfn 1 month ago

The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.

simonw 1 month ago
This is the key to the whole thing in my opinion.
If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.
- storystarling 1 month ago
  
  Fuzzing handles the logic verification, but I'd be more worried about the architectural debt of mapping GC patterns to Rust. You often end up with a mess of Arc/Mutex wrappers and cloning just to satisfy the borrow checker, which defeats the purpose of the port.
  
  1 reply →
Herring 1 month ago
Yeah and he claims a pass rate of 99.96%. At that point you might be running into bugs in the original implementation.
- sanxiyn 1 month ago
  
  Not really. Due to combinatorial explosion some path is hard to hit randomly in this kind of source code. I would have preferred if after 2M random battles the reference implementation had 99% code coverage, than 99% pass rate.
  I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...
  > NOTE: ignoringAbility() and abilityState.ending not fully implemented
  So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.
  
  1 reply →

Palomides 1 month ago

I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system

ethin 1 month ago
This only works up to a certain point. Given that the author openly admits they don't know/understand Rust, there is a really high likelihood that the LLM made all kinds of mistakes that would be avoided, and the dev is going to be left flailing about trying to understand why they happen/what's causing them/etc. A hand-rewrite would've actually taught the author a lot of very useful things I'm guessing.
- galangalalgol 1 month ago
  
  It seems like they have something like differential fuzzing to guarantee identical behavior to the original, but they still are left with a codebase they cannot read...

rkozik1989 1 month ago

Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.

yieldcrv 1 month ago

1 month of Claude Code would be an incremental approach
It would honestly try to one-shot the whole conversion in a 30 minute autonomous session
jamesfinlayson 1 month ago
> often you don't actually need to
Feels like this one is always a mistake that needs to be made for the lesson to be learned.
- port11 1 month ago
  
  At this point it seems pretty clear that all projects ported from Ruby to Python, then Python to Typescript, must now be ported to Rust. It will solve almost all problems of the tech industry…

captbaritone 1 month ago

His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.

topaz0 1 month ago

Back of the envelope, they'll need to use this on the order of a billion times to break even, under the (laughable) assumption that running claude code uses comparable compute as the computer he's running his code on. So more like hundreds of billions or trillions, I'd guess.

ferguess_k 1 month ago

I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.

atonse 1 month ago

My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).

They are remarkably good at catching things, especially if you do it every commit.

usrbinbash 1 month ago
> My answer to this is to often get the LLMs to do multiple rounds of code review
So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?
Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.
- atonse 1 month ago
  
  It sounds to me like you may not have used a lot of these tools yet, because your response sounds like pushback around theoreticals.
  Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).
  AI code reviews are already quite good, and are only going to get better.
  
  5 replies →
- pluralmonad 1 month ago
  
  Implementation -> review cycles are very useful when iterating with CC. The point of the agent reviewer is not to take the place of your personal review, but to catch any low hanging fruit before you spend your valuable time reviewing.
  
  2 replies →
- hombre_fatal 1 month ago
  
  Well, you can review its reasoning. And you can passively learn enough about, say, Rust to know if it's making a good point or not.
  Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?
  For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.
  I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.
  Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.
  At no point do I need to go "ok I'll blindly trust this answer".
- ctoth 1 month ago
  
  Wait until you start working with us imperfect humans!
  
  2 replies →

rvz 1 month ago

> How is this a good idea? How can I trust the generated code?

You don't. The LLMs wrote the code and is absolutely right. /s

What could possibly go wrong?

eddythompson80 1 month ago

Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.