← Back to context

Comment by xiphias2

1 day ago

I'm actually excited for somebody trying experimenting with automated translation, but I'm afraid this will be lots of backwards compatibility issues.

I started looking at the commits, and it's basically solving the ,,tests not pass'' problem by changing the tests themselves. The real work of making it working on programs that are already deployed will be just starting now.

The only silver lining I see is that the server side JS community for some reason is already used to breakages all the time.

The whole idea that my RUNTIME contains code that a single human hasn't looked at does make me uncomfortable, but if this actually works without a ton of issues it's pretty remarkable.

  • Don't worry, no one reviewed open source code before AI either. Basically nothing changed about the trust model.

    • The speed of the change did. This is the “climate has always been changing” argument climate deniers make. It is a true statement which is still a lie by omission. Climate deniers purposely ignore that the climate has never changed at the current rate, and AI-stans neglect to mention that before AI nobody was merging a 1M+ lines of code in one go.

> I started looking at the commits, and it's basically solving the ,,tests not pass'' problem by changing the tests themselves

Not sure if these decisions were made by the LLM, but I've always felt that Claude is more prone to doing "shady stuff" like modifying tests than finding correct solutions to problems.

GPT/Codex is more honest in this regard.

  • Yeah, Claude is very creative in finding ways of "solving" problems that go against what the user probably intended.

    Having said that, after looking at some of the test changes, they seem to be minor things, like changing timeouts, not changing the actual intended semantics of the tests. But it's too much code to review everything, so I might be completely wrong about that, and in real-world usage, even minor changes like these will cause issues.

I doubt it will end up as stable release very soon, but I'm happy to be proven wrong. I have some skepticism about this whole rewrite, Jarred Sumner has enormous internet following and it feels like an ad.

  • How do you wash to define ad, and why does it matter? If I tell you I had lunch, I mean. okay, great. If I tell you I had a delicious Coca-Cola with my lunch, sure. If I happen to work at Coca-Cola, does that now become an ad? And what level does it become an issue? And I what is the issue?

    • If you work for Coca-Cola then yea there’s reason to question your intent even if simply because you aren’t objective due to your proximity to Coca-Cola.

> solving the ,,tests not pass'' problem by changing the tests themselves

https://github.com/oven-sh/bun/pull/30412/changes/68a34bf8ed...

This is great! Just add a random sleep(1) to a test, don't worry about it, it's going to be fine!

  • On the other hand, the sleep fits better to the test description, "should allow reading stdout after a few milliseconds". Even if 1 != 'a few'. It's possible the part of the commit reverted here, https://github.com/oven-sh/bun/commit/a42bf70139980c4d13cc55..., defeated the purpose of the test by removing the sleep. I don't think adding the sleep back is an example of AI cheating.

    Strange test though either way.

  • To be fair the commit message `revert proc.exited change in spawn.test.ts` suggests the sleep was there originally.

I wish I could take a look through the tests to see if anything substantial actually changed, but I can't even get github to load the diffs for me.

> I started looking at the commits, and it's basically solving the ,,tests not pass'' problem by changing the tests themselves. The real work of making it working on programs that are already deployed will be just starting now.

Wow, This is definitely quite something for sure.

Can jarred comment about if he has read the commits or not too or respond to your comment, this has basically made me lose the small faith I had in what bun is doing if it turns out to be correct.

  • It's OK, we'll see how it goes. He and Antropic are giving it us for free, and nowdays just forking the old version is easy if a project needs that. Even maintenance is much easier using LLMs.

    I'm happy it's not a project I'm depending on, but a large enough project had to try this at some point so that we all can learn from how it goes.

    I think this is why Antropic bought bun, so that they can sell big code translation as a feature for all the banks with COBOL code that they want to get rid of for a long time.

    Still, those banks / enterprises won't appreciate the number of unit test changes.

    And I agree with another comment that Codex xhigh is much better for these kinds of tasks, but still hard on this kind of scale.

  • Jared has commented on this elsewhere in the thread, basically claiming the parent you replied to is outright lying: it has removed no tests and has not meaningfully changed annotations to reduce coverage of effectiveness. It added additional tests and made a few changes to hard coded values due to differences in, as an example, how LLVM and Zig handle stack frames.

    The MR is right there, linked at the top of this page. You can check who is telling the truth.

    That said, I don't know how anyone is actually claiming to have done that. All day, the size of the MR makes the diff take too long to load and GitHub dies. I'll have to pull it later to check myself.

> it's basically solving the ,,tests not pass'' problem by changing the tests themselves.

False.

0 test files were deleted. 0 pre-existing tests were skipped, todo’d, or had assertions removed. 5 new tests were added in test.skip/test.todo state to track known not-yet-fixed bugs in the port that lacked test coverage before.

The merge changed 28 test files in total.

+1,312 lines

−141 lines

Most of that +1,312 is new tests.

The depth-of-recursion tests for TOML/JSONC parsers went from 25_000 -> 200_000 because Rust’s smaller stack frames (LLVM lifetime annotations let the optimizer reuse stack slots) mean 25k levels no longer reaches the 18 MB stack on Windows.

  • We're keeping this honest and chill, no worries.

    What is "most of that "?

    Why did you feel the need to produce so much detail about a single category of tests?

  • That's great!

    It's too bad you haven't structured the commits and pull requests a bit differently so that it's easier to review the exact changes, but I hope it goes well.

    For example doing the test refactorings in a first pull request, and using something like test.xfail that is first fails then after the merge succeeds (but the test code itself doesn't change).

    Also I have seen some tests getting stricter, which is again not a problem, but separating to a different pull request would have improved the reviewability significantly for a runtime that many people and companies depend on.

    I'm sorry you were downvoted by HN and your comment got ,,dead'', that's not the way to review things.

in tsz[0] 100% of tests pass yet I have a ton of bugs. I don't think any software out there is fully tested really. I'm experimenting this this idea as well. So far learned a ton.

I'm convinced the future of writing code is heavily LLM assisted

[0] https://tsz.dev