Comment by ux266478

2 months ago

Can these claims back themselves up with a study showing that over a large population size with sufficient variety, sourced from a collection of diverse environments, LLM output across a period of time is more reliably correct and without issue when outputting Rust? Otherwise this is nothing but unempirical conjecture.

4 comments

ux266478

IshKebab 2 months ago

Ah the classic "show me ironclad evidence of this impossible-to-prove-but-quite-clear thing or else you must be wrong!"

Although we did recently get pretty good evidence of those claims for humans and it would be very surprising if the situation were completely reversed for LLMs (i.e. humans write Rust more reliably but LLMs write C more reliably).

https://security.googleblog.com/2025/11/rust-in-android-move...

I'm not aware of any studies pointing in the opposite direction.

ux266478 2 months ago
Actually it's the classic "inductive reasoning has to meet a set of strict criteria to be sound." Criteria which this does not meet. Extrapolation from a sample size of one? In a context without any LLM involvement? That's not sound, the conclusion does not follow. The point being, why bother making a statistical generalization? Rust's safety is formally known, deduction over concrete postulates was appropriate.
> it would be very surprising if the situation were completely reversed for LLMs
Lifetimes must be well-defined in safe Rust, which requires a deep degree of formal reasoning. The kind of complex problem analysis where it is known that LLMs produce worse results in than humans. Specifically in the context of security vulnerabilities, LLMs produce marginally less but significantly more severe issues in memory safe languages[1]. Still though, we might say LLMs will produce safer code with safe Rust, on the basis that 100,000 vibe coded lines will probably never compile.
[1] - https://arxiv.org/html/2501.16857v1
- IshKebab 2 months ago
  
  I never claimed to be doing a formal proof. If someone said "traffic was bad this morning" would you say "have you done a scientific study on the average journey times across the year and for different locations to know that it was actually bad"?
  > LLMs produce worse results in than humans
  We aren't talking about whether LLMs are better than humans.
  Also we're obviously talking about Rust code that compiles. Code that doesn't compile is 100% secure!
  
  1 reply →