Comment by AtlasBarfed
6 days ago
I'd like one to do my test use case:
Port unix-sed from c to java with a full test suite and all options supported.
Somewhere between "it answers questions of life" and "it beats PhDs at math questions", I'd like to see one LLM take this, IMO, rather "pure" language task and succeeed.
It is complicated, but it isn't complex. It's string operations with a deep but not that deep expression system and flag set.
It is well-described and documented on the internet, and presumably training sets. It is succinctly described as a problem that virtually all computer coders would understand what it entailed if it were assigned to them. It is drudgerous, showing the opportunity for LLMs to show how they would improve true productivity.
GPT fails to do anything other than the most basic substitute operations. Claude was only slightly better, but to its detriment hallucinated massive amounts and made fake passing test cases that didn't even test the code.
The reaction I get to this test is ambivalence, but IMO if LLMs could help port entire software packages between languages with similar feature sets (aside from Turing Completeness), then software cross-use would explode, and maybe we could port "vulnerable" code to "safe" Rust en masse.
I get it, it's not what they are chasing customer-wise. They want to write (in n-gate terms) webcrap.
I have a very simple question with like, 5 lines at best, that basically no model, neither reasoning or simpler could grasp. For obvious reasons I'm not disclosing it here (because I fear data contamination in the long run), but it basically breaks the "reasoning" of those things. Unfortunately, I still can't try the o3-pro because the API version is not easily available, and I'm certainly not willing to pay for it in pro mode, but when it comes to the plus version (if it comes) I'll try. To this date, because of this question (and similar ones) I stand very unimpressed with those models, the marketing is a thousand times larger than reality, and I suspect people in general are surprisingly less capable of detecting intelligence than they think.
The normal o3 also managed to break 3 isolated installations of linux I was trying it with, a few days ago. The task was very simple, simply setup ubuntu with btrfs, timeshift and grub-btrfs and it managed to fail every single time (even when searching the web), so it was not impressive either.
The massive real market here is enterprises that need to rewrite legacy code to modern platforms, retaining the business logic as-is but modernising the style.
.NET Framework 4.x to .NET 10, Python 2 to 3, Java 8 to <current version>, etc...
The advantage the LLMs have here is that staying within the same programming language and its paradigm is dramatically simpler than converting a "procedural" language like C to an object-oriented language like Java that has a wildly different standard library.
How does the latest Gemini 2.5 Pro Ultra Flash Max Hemi XLT release do on that task? It obviously demands a massive context window.
I'll check once I get the nitrous tanks and the aftermarket turbos overnighted from Japan arrive.