Comment by josu

2 months ago

>So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.

I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.

- https://highload.fun/tasks/3/leaderboard

- https://highload.fun/tasks/12/leaderboard

- https://highload.fun/tasks/15/leaderboard

- https://highload.fun/tasks/18/leaderboard

- https://highload.fun/tasks/24/leaderboard

17 comments

josu

johndough 2 months ago

All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:

    - padding from 2000 to 2048 for easier power-of-two splitting
    - two-level Winograd matrix multiplication with tiled matmul for last level
    - unrolled AVX2 kernel for 64x64 submatrices
    - 64 byte aligned memory
    - restrict keyword for pointers
    - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)

But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?

josu 2 months ago

Thank you. Yeah, I'm doing all those things, which do get you close to the top. The rest of things I'm doing are mostly micro-optimizations such as finding a way to avoid AVX→SSE transition penalty (1-2% improvement).
But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.
For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...
Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.

zarzavat 2 months ago

How are you qualified to judge its performance on real code if you don't know how to write a hello world?

Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.

When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.

Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.

josu 2 months ago

I know what's like running a business, and building complex systems. That's not the point.
I used highload as an example because it seems like an objective rebuttal to the claim that "but it can't tackle those complex problems by itself."
And regarding this:
"Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash"
Again, a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.
VMG 2 months ago

> Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
The skill of "a human software developer" is in fact a very wide distribution, and your statement is true for a ever shrinking tail end of that
FeepingCreature 2 months ago
> How are you qualified to judge its performance on real code if you don't know how to write a hello world?
The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.
- LucaMo 2 months ago
  
  What I think people get wrong (especially non-coders) is that they believe the limitation of LLMs is to build a complex algorithm. That issue in reality was fixed a long time ago. The real issue is to build a product. Think about microservices in different projects, using APIs that are not perfectly documented or whose documentation is massive, etc.
  Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there. At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist. I built everything to be functional and reusable myself, in approximately 300 lines of code.
  The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric
  
  4 replies →

throw1235435 2 months ago

If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert.

I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.

ModernMech 2 months ago

None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems".

josu 1 month ago
I think you misunderstood, it's not about solving the problem, is about finding the most efficient solution. Give it a shot, and see if you can get to the top 10 on any task.
- ModernMech 1 month ago
  
  The point is these problems are well understood and solved, solving them well with AI doesn’t mean anything as you’re just reciting something that’s already been done already.

dajoh 2 months ago

>I'm currently #1 on 5 different very complex computer engineering problems

Ah yes, well known very complex computer engineering problems such as:

* Parsing JSON objects, summing a single field

* Matrix multiplication

* Parsing and evaluating integer basic arithmetic expressions

And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?

josu 2 months ago

Lol, the problem is not finding a solution, the problem is solving it in the most efficient way.
If you think you can beat an LLM, the leaderboard is right there.