Comment by andrepd

1 month ago

> Those new fangled doohickeys just aren't reliable

Except they are (unlike a chatbot, a calculator is perfectly deterministic), and the unreliability of LLMs is one of their most, if not the most, widespread target of criticism.

Low effort doesn't even begin to describe your comment.

4 comments

andrepd

jama211 1 month ago

As low effort as you hand waving away any nuance because it doesn’t agree with you?

mapontosevenths 1 month ago

> Except they are (unlike a chatbot, a calculator is perfectly deterministic)

LLM's are supposed to be stochastic. That is not a bug, I can see why you find that disappointing but it's just the reality of the tool.

However, as I mentioned elsewhere calculators also have bugs and those bugs make their way into scientific research all the time. Floating point errors are particularly common, as are order of operations problems because physical devices get it wrong frequently and are hard to patch. Worse, they are not SUPPOSED TO BE stochastic so when they fail nobody notices until it's far too late. [0 - PDF]

Further, spreadsheets are no better, for example a scan of ~3,600 genomics papers found that about 1 in 5 had gene‑name errors (e.g., SEPT2 → “2‑Sep”) because that's how Excel likes to format things.[1] Again, this is much worse than a stochastic machine doing it's stochastic job... because it's not SUPPOSED to be random, it's just broken and on a truly massive scale.

[0] https://ttu-ir.tdl.org/server/api/core/bitstreams/7fce5b73-1...

[1]https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...

raddan 1 month ago
That’s a strange argument. There are plenty of stochastic processes that have perfectly acceptable guarantees. A good example is Karger’s min-cut algorithm. You might not know what you get on any given single run, but you know EXACTLY what you’re going to get when you crank up the number of trials.
Nobody can tell you what you are going to get when you run an LLM once. Nobody can tell you what you’re going to get when you run it N times. There are, in fact, no guarantees at all. Nobody even really knows why it can solve some problems and why it can’t solve other except maybe it memorized the answer at some point. But this is not how they are marketed.
They are marketed as wondrous inventions that can SOLVE EVERYTHING. This is obviously not true. You can verify it yourself, with a simple deterministic problem: generate an arithmetic expression of length N. As you increase N, the probability that an LLM can solve it drops to zero.
Ok, fine. This kind of problem is not a good fit for an LLM. But which is? And after you’ve found a problem that seems like a good fit, how do you know? Did you test it systematically? The big LLM vendors are fudging the numbers. They’re testing on the training set, they’re using ad hoc measurements and so on. But don’t take my word for it. There’s lots of great literature out there that probes the eccentricities of these models; for some reason this work rarely makes its way into the HN echo chamber.
Now I’m not saying these things are broken and useless. Far from it. I use them every day. But I don’t trust anything they produce, because there are no guarantees, and I have been burned many times. If you have not been burned, you’re either exceptionally lucky, you are asking it to solve homework assignments, or you are ignoring the pain.
Excel bugs are not the same thing. Most of those problems can be found trivially. You can find them because Excel is a language with clear rules (just not clear to those particular users). The problem with Excel is that people aren’t looking for bugs.
- mapontosevenths 1 month ago
  
  > But I don’t trust anything they produce, because there are no guarantees
  > Did you test it systematically?
  Yes! That is exactly the right way to use them. For example, when I'm vibe coding I don't ask it to write code. I ask it to write unit tests. THEN I verify that the test is actually testing for the right things with my own eyeballs. THEN I ask it to write code that passes the unit tests.
  Same with even text formatting. Sometimes I ask it to write a pydantic script to validate text inputs of "x" format. Often writing the text to specify the format is itself a major undertaking. Then once the script is working I ask for the text, and tell it to use the script to validate it. After that I can know that I can expect deterministic results, though it often takes a few tries for it to pass the validator.
  You CAN get deterministic results, you just have to adapt your expectations to match what the tool is capable of instead of expecting your hammer to magically be a great screwdriver.
  I do agree that the SOLVE EVERYTHING crowd are severely misguided, but so are the SOLVE NOTHING crowd. It's a tool, just use it properly and all will be well.