Comment by rurp

5 months ago

I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.

The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.

At that point I just went ahead and read some docs and other resources and solved things the traditional way.

Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.

I'm not convinced LLMs will evolve into general AI. The promises that it's just around the corner feels increasingly like a big scam.

  • I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.

  • We just need a little more of your (not mine) money to get there. Would I lie to you for $100 billion?

  • Depends what you mean by evolve. I don't think we'll get general AI by simply scaling LLMs, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.

    • Embedding spaces are one thing, LLMs are quite another.

      I believe the former are understandable and likely a part of true AGI but the latter a series of hacks, at worst a red herring leading us off the proper track into a deadend.

    • Is there a resource I can use to understand the difference between embedding space and latent space?

  • I mean it’s been a couple of years!

    It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.

    • Is it intentional deceit to tell everyone it's leading to something when, as you correctly point out, nobody actually knows if it will?

      1 reply →

    • >“scam” means intentional deceit.

      Yes. I'm pretty sure any engineer working on this knows it's not "a few years away". But it doesn't stop product teams from taking adcvantadge of the hype cycle. Hence, "use deception to deprive (someone) of money or possessions.".

      4 replies →

LLMs coding performance is directly proportional to amount of stolen data for the learning process. That's why frontend folks are swearing by it and are forecasting our new god dominance in just a few years. That's because frontend code is literally out there mostly, just take it and compile into a dataset. Stuff like SQL DBs is not laying on every internet corner and is probably very underrepresented in the dataset, producing inferior performance. Same with rare or systems languages, like Rust for example, LLMs are also very bad with it.

It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.

I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.

I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.

This happens in about one third of my coding interactions with LLMs. I've been trying to get better at handling the situation. At some point it's clear you've explained the problem well enough and the LLM actually is regurgitating the same wrong answer, unable to make progress. It would be useful to spot this asap.

I enjoy working with very strongly typed languages (Elm, Haskell), and it's hard for me to avoid "just paste the compile error to the LLM it only takes a second" trap. At some point (usually around three back-and-forths), if the LLM can't fix the error, it will just generate increasingly different compile errors. It's a matter of choosing which one I decide to actually dive into (this is more of a problem with Haskell than Elm, as Elm compile errors are second to none).

  • Honest question -- not trying to be offensive, but what are you using elm for? Everywhere I've encountered it it's some legacy system that no one has cared to migrate yet and it's a complete dumpster fire.

    You spend about three days trying to get it to build then say fuck it and rewrite it.

    At least, that's the story of the last (and only) three times I've seen elm code in the wild.

    • I'm not really a frontend developer. I'm using Elm for toy projects, in fact I did one recently.[0] Elm is my favourite language!

      > You spend about three days trying to get it to build then say fuck it and rewrite it.

      What are the problems you encounter? I can't quite imagine in what way an Elm project could be hard to build! (Also not trying to be offensive, but I almost don't believe you!)

      And into which language do you rewrite those "dumpster fire" Elm codebases?

      [0] https://github.com/tasuki/iso-maze

      2 replies →

While I do find llm’s useful, it’s mostly for simple and repetitive tasks.

In my opinion, they aren’t actually coding anything and have no amount of understanding. They are simply advanced at searching things and pasting back an answer that they scraped online. They can also run simple transformations on those snippets like rename variables. But if you tell it there’s a problem, it doesn’t try to actually solve the problem. It just traverses to the same branch in the tree and tries to give you another similar solution in the tree or if there’s nothing better it will give you the same solution but maybe run a transformation on it.

So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.

  • > So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.

    Teach your kids how to be resourceful and curious. Coding is just a means to an end to do that. Agreed though it’s a great one.

I had to do something similar with BigQuery and some open source datasets recently.

I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.

But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.

Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.

Maybe the model and the lack of CoT could be part of the challenge you ran into?

  • > and provided bits of the docs.

    At this point I'd ask myself whether I want my original problem solved or if I just want the LLM to succeed with my requested task.

    • Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs

I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up

For what it’s worth, I recently wrote an SQL file that gave an error. I tried to fix it myself and searched the internet but couldn’t solve it. I pasted it into Claude and it solved the error immediately.

I am a paying user of both Claude AI and ChatGPT, I think for the use case you mention ChatGPT would have done better than Claude. At $20/month I recommend that you try it for the same use case. o1 might have succeeded where Claude failed.

  • Meh. Ballpark they're very similar. I think people overestimate the differences between the LLMs. Similarly to how people overestimate the differences between... people. The difference between the village idiot and Einstein only looks like a big difference to us humans. In the grand scale of things, they're pretty similar.

    Now, obviously, LLMs and humans aren't that similar! Different amount of knowledge, different failure modes, etc.

I consider now that all LLMs are just for give you a rough shape, but you must rewrite it manually, doing judgments on what to keep or not.

There is no one "SQL", unfortunately. All of the major database engines have their own forks and extensions. If you didn't specify which one you were using (Microsoft SQL, Oracle, Postgres, SQLite, MySQL), then you didn't give the LLM enough information to work with, same as a junior engineer working blindly with only the information you give them.

  • I left that part out for brevity, but I told Claude the version of Postgres I was using at the start, and even specified that the mistake it produced is invalid in Postgres.

When claude gets in a loop like this the best thing to do is just start over in a new window.

When you line up claude on some good context and a good question it does really well. There are more specialized llms for sql I would try one of those. Claude is a generalist and for that, it's not great at everything.

It's really good at react and python -- as someone else mentioned -- that junior code is public and available.

However, random sql needs more "guiding" via the prompt. Explain more about the data and why it's wrong. Tell claude, "I think you're producing slop" and he will break out of his loop.

Good luck!

I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.

So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)

When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)

But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.

Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.

I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.

Makes perfect sense why it couldn't answer your question, you didn't have the vocabulary of relational algebra to correctly prime the model. Any rudimentary field have their own corpus vocabulary to express ideas and concepts specific to that domain.