Comment by FeepingCreature

4 days ago

> How are you qualified to judge its performance on real code if you don't know how to write a hello world?

The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.

What I think people get wrong (especially non-coders) is that they believe the limitation of LLMs is to build a complex algorithm. That issue in reality was fixed a long time ago. The real issue is to build a product. Think about microservices in different projects, using APIs that are not perfectly documented or whose documentation is massive, etc.

Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there. At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist. I built everything to be functional and reusable myself, in approximately 300 lines of code.

The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric

  • > The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric.

    I've also built a bitorrent implementation from the specs in rust where I'm keeping the binary under 1MB. It supports all active and accepted BEPs: https://www.bittorrent.org/beps/bep_0000.html

    Again, I literally don't know how to write a hello world in rust.

    I also vibe coded a trading system that is connected to 6 trading venues. This was a fun weekend project but it ended up making +20k of pure arbitrage with just 10k of working capital. I'm not sure this proves my point, because while I don't consider myself a programmer, I did use Python, a language that I'm somewhat familiar with.

    So yeah, I get what you are saying, but I don't agree. I used highload as an example, because it is an objective way of showing that a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.

  • This hits the nail on the head. There's a marked difference between a JSON parser and a real world feature in a product. Real world features are complex because they have opaque dependencies, or ones that are unknown altogether. Creating a good solution requires building a mental model of the actual complex system you're working with, which an LLM can't do. A JSON parser is effectively a book problem with no dependencies.

    • You are looking at this wrong. Creating a json parser is trivial. The thing is that my one-shot attempt was 10x slower than my final solution.

      Creating a parser for this challenge that is 10x more efficient than a simple approach does require deep understanding of what you are doing. It requires optimizing the hot loop (among other things) that 90-95% of software developers wouldn't know how to do. It requires deep understanding of the AVX2 architecture.

      Here you can read more about these challenges: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...

  • You need to give it search and tool calls and the ability to test its own code and iterate. I too could not oneshot an interaction layer with Stripe without tools. It also helps to make it research a plan beforehand.