← Back to context

Comment by rafaelmn

18 hours ago

This should be the real benchmark of AI coding skills - how fast do we get safe/modern infrastructure/tooling that everyone agrees we need but nobody can fund the development.

If Anthropic wants marketing for Mythos without publishing it - show us servo contrib log or something like that. It aligns nicely with their fundamental infrastructure safety goals.

I'd trust that way more than x% increase on y bench.

Hire a core contributor on Servo or Rust, give him unlimited model access and let's see how far we get with each release.

We do not need vibe-coded critical infrastructure.

  • As I see it, the focus should not be about the coding, but about the testing, and particularly the security evaluation. Particularly for critical infrastructure, I would want us to have a testing approach that is so reliable that it wouldn't matter who/what wrote the code.

    • I dont think that will ever be possible.

      At some point security becomes - the program does the thing the human wanted it to do but didn't realize they didn't actually want.

      No amount of testing can fix logic bugs due to bad specification.

      5 replies →

    • I have been thinking about that lately and isn't testing and security evaluation way harder problem than designing and carefully implementing new features? I think that vibecoding automates easiest step in SW development while making more challenging/expensive steps harder. How are we suppose to debug complex problems in critical infrastructure if no one understands code? It is possible that in future agents will be able to do that but it feels to me that we are not there yet.

    • I disagree. Thorough testing provides some level of confidence that the code is correct, but there's immense value in having infrastructure which some people understand because they wrote it. No amount of process around your vibe slop can provide that.

      6 replies →

  • >> ...give him unlimited model access

    >We do not need vibe-coded critical infrastructure.

    I think when you have virtually unlimited compute, it affords the ability to really lock down test writing and code review to a degree that isn't possible with normal vibe code setups and budgets.

    That said for truly critical things, I could see a final human review step for a given piece of generated code, followed by a hard lock. That workflow is going to be popular if it already isn't.

  • If you're trusting core contributors without AI I don't see why you wouldn't trust them with it.

    Hiring a few core devs to work on it should be a rounding error to Anthropic and a huge flex if they are actually able to deliver.

    • It's extremely tempting to write stuff and not bother to understand it similar to the way most of us don't decompile our binaries and look at the assembler when we write C/C++.

      So, should I trust an LLM as much as a C compiler?

  • They're getting really good at proofs and theorems, right?

    • Proofs/theorems and memory safety vulnerabilities are a special case because there's an easy way to verify whether the model is bullshitting or not.

      That's not true for coding in general. The best you can do is having unreasonably good test coverage, but the vast majority of code doesn't have that.

  • Well if the big players want to tell me their models are nearly AGI they need to put up or shut up. I don't want a stochastically downloaded C compiler. I want tech that improves something.

> show us servo contrib log or something like that

Servo may not be the best project for this experiment, as it has a strict no-AI contributions allowed policy.

The problem with such infrastructure is not the initial development overhead.

It's the maintenance. The long term, slow burn, uninteresting work that must be done continually. Someone needs to be behind it for the long haul or it will never get adopted and used widely.

Right now, at least, LLMs are not great at that. They're great for quickly creating smaller projects. They get less good the older and larger those projects get.

  • I mean the claim is that next generation models are better and better at executing on larger context. I find that GPT 5.4 xhigh is surprisingly good at analysis even on larger codebases.

    https://x.com/mitchellh/status/2029348087538565612

    Stuff like this where these models are root causing nontrivial large scale bugs is already there in SOTA.

    I would not be surprised if next generation models can both resolve those more reliability and implement them better. At that point would be sufficiently good maintainers.

    They are suggesting that new models can chain multiple newly discovered vulnerabilities into RCE and privilege escalations etc. You can't do this without larger scope planning/understanding, not reliabily.

Replicating Chromium as a benchmark? ;)

Replicating Rust would also be a good one. There are many Rust-adjacent languages that ought to exist and would greatly benefit mankind if they were created.

The true solution to this is to fund things that are important, especially when billion-dollar companies are making a fortune from them.

Perhaps, you know, not every thing, especially not every thread on HN, has to be about AI?

I read the link twice and no AI or LLM mentioned. I don't know why people are so eager to chime in and try to steer the conversation towards AI.