← Back to context

Comment by codingwagie

9 months ago

I just used o3 to design a distributed scheduler that scales to 1M+ sxchedules a day. It was perfect, and did better than two weeks of thought around the best way to build this.

You just asked it to design or implement?

If o3 can design it, that means it’s using open source schedulers as reference. Did you think about opening up a few open source projects to see how they were doing things in those two weeks you were designing?

  • why would I do that kind of research if it can identify the problem I am trying to solve, and spit out the exact solution. also, it was a rough implementation adapted to my exact tech stack

    • Because that path lies skill atrophy.

      AI research has a thing called "the bitter lesson" - which is that the only thing that works is search and learning. Domain-specific knowledge inserted by the researcher tends to look good in benchmarks but compromise the performance of the system[0].

      The bitter-er lesson is that this also applies to humans. The reason why humans still outperform AI on lots of intelligence tasks is because humans are doing lots and lots of search and learning, repeatedly, across billions of people. And have been doing so for thousands of years. The only uses of AI that benefit humans are ones that allow you to do more search or more learning.

      The human equivalent of "inserting domain-specific knowledge into an AI system" is cultural knowledge, cliches, cargo-cult science, and cheating. Copying other people's work only helps you, long-term, if you're able to build off of that into something new; and lots of discoveries have come about from someone just taking a second look at what had been considered to be generally "known". If you are just "taking shortcuts", then you learn nothing.

      [0] I would also argue that the current LLM training regime is still domain-specific knowledge, we've just widened the domain to "the entire Internet".

      9 replies →

    • Because as far as you know, the "rough implementation" only works in the happy path and there are really bad edge cases that you won't catch until they bite you, and then you won't even know where to look.

      An open source project wouldn't have those issues (someone at least understands all the code, and most edge cases have likely been ironed out) plus then you get maintenance updates for free.

      9 replies →

    • I was pointing out that if you spent 2 weeks trying to find the solution but AI solved it within a day (you don’t specify how long the final solution by AI took), it sounds like those two weeks were not spent very well.

      I would be interested in knowing what in those two weeks you couldn’t figure out, but AI could.

      2 replies →

    • Who hired you and why are they paying you money?

      I don't want to be a hater, but holy moley, that sounds like the absolute laziest possible way to solve things. Do you have training, skills, knowledge?

      This is an HN comment thread and all, but you're doing yourself no favors. Software professionals should offer their employers some due diligence and deliver working solutions that at least they understand.

  • yeah unless you have very specific requirements I think the baseline here is not building/designing it yourself but setting up an off-the-shelf commercial or OSS solution, which I doubt would take two weeks...

    • Dunno, in work we wanted to implement a task runner that we could use to periodically queue tasks through a web UI - it would then spin up resources on AWS and track the progress and archive the results.

      We looked at the existing solutions, and concluded that customizing them to meet all our requirements would be a giant effort.

      Meanwhile I fed the requirement doc into Claude Sonnet, and with about 3 days of prompting and debugging we had a bespoke solution that did exactly what we needed.

      7 replies →

While impressive, I'm not convinced that improved performance on tasks of this nature are indicative of progress toward AGI. Building a scheduler is a well studied problem space. Something like the ARC benchmark is much more indicative of progress toward true AGI, but probably still insufficient.

  • The point is that AGI is the wrong bar to be aiming for. LLMs are sufficiently useful at their current state that even if it does take us 30 years to get to AGI, even just incremental improvements from now until then, they'll still be useful enough to provide value to users/customers for some companies to win big. VC funding will run out and some companies won't make it, but some of them will, to the delight of their investors. AGI when? is an interesting question, but might just be academic. we have self driving cars, weight loss drugs that work, reusable rockets, and useful computer AI. We're living in the future, man, and robot maids are just around the corner.

  • the other models failed at this miserably. There were also specific technical requirements I gave it related to my tech stack

“It does something well” ≠ “it will become AGI”.

Your anodectical example isn't more convincing than “This machine cracked Enigma's messages in less time than an army of cryptanalysts over a month, surely we're gonna reach AGI by the end of the decade” would have.

I find now I quickly bucket people in to "have not/have barely used the latest AI models" or "trolls" when they express a belief current LLMs aren't intelligent.

  • You can put me in that bucket then. It's not true, I've been working with AI almost daily for 18 months, and I KNOW it's no where close to being intelligent, but it doesn't look like your buckets are based on truth but appeal. I disagree with your assessment so you think I don't know what I'm talking about. I hope you can understand that other people who know just as much as you (or even more) can disagree without being wrong or uninformed. LLMs are amazing, but they're nowhere close to intelligent.

Designing a distributed scheduler is a solved problem, of course an LLM was able to spit out a solution.

  • as noted elsewhere, all other frontier models failed miserably at this

    • It is unsurprising that some lossily-compressed-database search programs might be worse for some tasks than other lossily-compressed-database search programs.

    • That doesn't mean the one what manages to spit it out of its latent space is close to AGI. I wonder how consistently that specific model could. If you tried 10 LLMs maybe all 10 of them could have spit out the answer 1 out of 10 times. Correct problem retrieval by one LLM and failure by the others isn't a great argument for near-AGI. But LLMs will be useful in limited domains for a long time.

I’ve had similar things over the last couple days with o3. It was one-shotting whole features into my Rust codebase. Very impressive.

I remember before ChatGPT, smart people would come on podcasts and say we were 100 or 300 years away from AGI.

Then we saw GPT shock them. The reality is these people have no idea, it’s just catchy to talk this way.

With the amount of money going into the problem and the linear increases we see over time, it’s much more likely we see AGI sooner than later.

I'm not sure what is your point in context of AGI topic.

  • im a tenured engineer, spent a long time at faang. was casually beat this morning by a far superior design from an llm.

    • is this because the LLM actually reasoned on a better design or because it found a better design in its "database" scoured from another tenured engineer.

      3 replies →