What can LLMs never do?

1 year ago (strangeloopcanon.com)

384 comments

henrik_w

Fantastic essay. Highly recommended!

I agree with all key points:

* There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.

* Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.

* It's unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI. As a model of computation, attention is very rigid, whereas human brains are always undergoing synaptic plasticity. There may be a more flexible architecture capable of AGI, but we don't know it yet.

* For now, using current AI models requires carefully constructing long prompts with right and wrong answers for computational problems, priming the model to reply appropriately, and applying lots of external guardrails (e.g., LLMs acting as agents that review and vote on the answers of other LLMs).

* Attention seems to suffer from "goal drift," making reliability hard without all that external scaffolding.

Go read the whole thing.

dragonwriter 1 year ago
> There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don’t fully understand why current LLMs are bad at these tasks.
I thought we did know for things like playing Wordle, that its because they deal with words as sequence of tokens that correspond to whole words not sequences of letters, so a game that involves dealing with sequences of letters constrained to those that are valid words doesn’t match the way they process information?
> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the “reasoning steps” and handing them to the LLM, instead of the LLM figuring them out by itself. We have “reasoning machines” that are intelligent but seem to be hitting fundamental limits we don’t understand.
But providing examples with different, contextually-appropriate sets of reasoning steps results can enable the model to choose its own, more-or-less appropriate, set of reasoning steps for particular questions not matching the examples.
> It’s unclear if better prompting and bigger models using existing attention mechanisms can achieve AGI.
Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific.
- rainsford 1 year ago
  
  Arriving at a generally accepted scientific definition of AGI might be difficult, but a more achievable goal might be to arrive at a scientific way to determine something is not AGI. And while I'm not an expert in the field, I would certainly think a strong contender for relevant criteria would be an inability to process information in a way other than the one a system was explicitly programmed to, even if the new way of processing information was very related to the pre-existing method. Most humans playing Wordle for the first time probably weren't used to thinking about words that way either, but they were able to adapt because they actually understand how letters and words work.
  I'm sure one could train an LLM to be awesome at Wordle, but from an AGI perspective the fact that you'd have to do so proves it's not a path to AGI. The Wordle dominating LLM would presumably be perplexed by the next clever word game until trained on thinking about information that way, while a human doesn't need to absorb billions of examples to figure it out.
  I was originally pretty bullish on LLMs, but now I'm equally convinced that while they probably have some interesting applications, they're a dead-end from a legitimate AGI perspective.
  
  9 replies →
- nathan_compton 1 year ago
  
  "Since there is no objective definition of AGI or test for it, there’s no basis for any meaningful speculation on what can or cannot achieve it; discussions about it are quasi-religious, not scientific."
  This is such a weird thing to say. Essentially _all_ scientific ideas are, at least to begin with, poorly defined. In fact, I'd argue that almost all scientific ideas remain poorly defined with the possible exception of _some_ of the basic concepts in physics. Scientific progress cannot be and is not predicated upon perfect definitions. For some reason when the topic of consciousness or AGI comes up around here, everyone commits a sort of "all or nothing" logical fallacy: absence of perfect knowledge is cast as total ignorance.
  
  4 replies →
- jncfhnb 1 year ago
  
  I think one should feel comfortable arguing that AGI must be stateful and experience continuous time at least. Such that a plain old LLM is definitively not ever going to be AGI; but an LLM called in a do while true for loop might.
  
  34 replies →
- layer8 1 year ago
  
  Regarding Wordle, it should be straightforward to make a token-based version of it, and I would assume that that has been tried. It seems the obvious thing to do when one is interested in the reasoning abilities necessary for Wordle.
  
  9 replies →
weitendorf 1 year ago
> There are problems that are easy for human beings but hard for current LLMs (and maybe impossible for them; no one knows). Examples include playing Wordle and predicting cellular automata (including Turing-complete ones like Rule 110). We don't fully understand why current LLMs are bad at these tasks.
Wordle and cellular automata are very 2D, and LLMs are fundamentally 1D. You might think "but what about Chess!" - except Chess is encoded extremely often as a 1D stream of tokens to notate games, and bound to be highly represented in LLMs' training sets. Wordle and cellular automata are not often, if ever, encoded as 1D streams of tokens - it's not something an LLM would be experienced with even if they had a reasonable "understanding" of the concepts. Imagine being an OK chess player, being asked to play a game blindfolded dictating your moves purely via notation, and being told you suck.
> Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand.
You have probably heard of this really popular game called Bridge before, right? You might even be able to remember tons of advice your Grandma gave you based on her experience playing it - except she never let you watch it directly. Is Grandma "figuring out the game" for you when she finally sits down and teaches you the rules?
- papichulo2023 1 year ago
  
  Not an authority in the matter, but afaik, with position encodings (part of the Transformers architecture), they can handle dimensionality just fine. Actually some people tried to do 2D Transformers and the results were the same.
  Visual transformers are gaining traction and they are 100% focus in 2d data.
- Sleepful 1 year ago
  
  Since when can LLM play chess? It can't understand it at all. You would have to filter out all the invalid moves until it spits a valid one.
cs702 1 year ago
As an aside, at one point I experimented a little with transformers that had access to external memory searchable via KNN lookups https://github.com/lucidrains/memorizing-transformers-pytorc... (great work by lucidrains) or via routed queries with https://github.com/glassroom/heinsen_routing (don't fully understand it; apparently related to attention). Both approaches seemed to work, but I had to put that work on hold for reasons outside my control.
Also as an aside, I'll add that transformers can be seen as a kind of "RNN" that grows its hidden state with each new token in the input context. I wonder if we will end up needing some new kind of "RNN" that can grow or shrink its hidden state and also access some kind of permanent memory as needed at each step.
We sure live in interesting times!
- esafak 1 year ago
  
  > transformers that had access to external memory searchable via KNN lookups
  This is common, and commonly called retrieval augmented generation, or RAG.
  edit: I did not pay attention to the link. It is about Wu et al's "Memorizing Transformers", which contain an internal memory.
  
  1 reply →
- int_19h 1 year ago
  
  I don't think the ability to shrink state is needed. You can always represent removed state by additional state that represents deletion of whatever preceding state was there. If anything, this sounds more useful because the fact that this state is no longer believed to be relevant should prevent looping (where it would be repeatedly brought in, considered, and rejected).
  
  1 reply →
_wire_ 1 year ago
>We don't fully understand why current LLMs are bad at these tasks.
In complete seriousness, can anyone can explain why LLMs are good at some tasks?
- SomeCallMeTim 1 year ago
  
  LLMs are good at tasks that don't require actual understanding of the topic.
  They can come up with excellent (or excellent-looking-but-wrong) answers to any question that their training corpus covers. In a gross oversimplification, the "reasoning" they do is really just parroting a weighted average (with randomness injected) of the matching training data.
  What they're doing doesn't really match any definition of "understanding." An LLM (and any current AI) doesn't "understand" anything; it's effectively no more than a really big, really complicated spreadsheet. And no matter how complicated a spreadsheet gets, it's never going to understand anything.
  Not until we find the secret to actual learning. And increasingly it looks like actual learning probably relies on some of the quantum phenomena that are known to be present in the brain.
  We may not even have the science yet to understand how the brain learns. But I have become convinced that we're not going to find a way for digital-logic-based computers to bridge that gap.
  
  37 replies →
- zer00eyz 1 year ago
  
  Yes:
  An LLM isnt a model of human thinking.
  An LLM is an attempt to build a simulation of human communication. An LLM is to language what a forecast is to weather. No amount of weather data is actually going to turn that simulation into snow, no amount of LLM data is going to create AGI.
  That having been said, better models (smaller, more flexible ones) are going to result in a LOT of practical uses that have the potential to make our day to day lives easier (think digital personal assistant that has current knowledge).
  
  8 replies →
- richardw 1 year ago
  
  LLM’s are a compressed and lossy form of our combined writing output, which it turns out is similarly structured enough to make new combinations of text seem reasonable, even enough to display simple reasoning. I find it useful to think “what can I expect from speaking with the dataset of combined writing of people”, rather than treating a basic LLM as a mind.
  That doesn’t mean we won’t end up approximating one eventually, but it’s going to take a lot of real human thinking first. For example, ChatGPT writes code to solve some questions rather than reasoning about it from text. The LLM is not doing the heavy lifting in that case.
  Give it (some) 3D questions or anything where there isn’t massive textual datasets and you often need to break out to specialised code.
  Another thought I find useful is that it considers its job done when it’s produced enough reasonable tokens, not when it’s actually solved a problem. You and I would continue to ponder the edge cases. It’s just happy if there are 1000 tokens that look approximately like its dataset. Agents make that a bit smarter but they’re still limited by the goal of being happy when each has produced the required token quota, missing eg implications that we’d see instantly. Obviously we’re smart enough to keep filling those gaps.
  
  4 replies →
- HarHarVeryFunny 1 year ago
  
  I'd guess because the Transformer architecture is (I assume) fairly close to the way that our brain learns and produces language - similar hierarchical approach and perhaps similar type of inter-embedding attention-based copying?
  Similar to how CNNs are so successful at image recognition, because they also roughly follow the way we do it too.
  Other seq-2-seq language approaches work too, but not as good as Transformers, which I'd guess is due to transformers better matching our own inductive biases, maybe due to the specific form of attention.
- j16sdiz 1 year ago
  
  > why LLMs are good at some tasks?
  Like how we explain human doing tasks -- they are evolved to do that.
  I believe this is a non-answer, but if we are satisfied with that non answer for human, why not LLMs?
  
  1 reply →
- pbhjpbhj 1 year ago
  
  If you look at transfer learning, I think that is a useful point at which to understand task-specific application and hence why LLMs excel at some tasks and not others.
  Tasks are specialised for using the training corpus, the attention mechanisms, the loss functions, and such.
  I'll leave it to others to expand on actual answers, but IMO focusing on transfer learning helps to understand how an LLM does inferences.
ccppurcell 1 year ago
I would argue that the G in AGI means it can't require better prompting.
- CamperBob2 1 year ago
  
  We should probably draw a distinction between a human-equivalent G, which certainly can require better prompting (why else did you go to school?!) and god-equivalent G, which never requires better prompting.
  Just using the term 'General' doesn't seem to communicate anything useful about the nature of intelligence.
  
  1 reply →
- dragonwriter 1 year ago
  
  That would like saying that because humans’ output can be better or worse based on better or worse past experience (~prompting, in that it is the source of the equivalent of “in-context learning”), humans lack general intelligence.
  
  3 replies →
ianbicking 1 year ago
"Providing an LLM with examples and step-by-step instructions in a prompt means the user is figuring out the "reasoning steps" and handing them to the LLM, instead of the LLM figuring them out by itself. We have "reasoning machines" that are intelligent but seem to be hitting fundamental limits we don't understand."
One thing an LLM _also_ doesn't bring to the table is an opinion. We can push it in that direction by giving it a role ("you are an expert developer" etc), but it's a bit weak.
If you give an LLM an easy task with minimal instructions it will do the task in the most conventional, common sense fashion. And why shouldn't it? It has no opinion, your prompt doesn't give it an opinion, so it just does the most normal-seeming thing. If you want it to solve the task in any other way then you have to tell it to do so.
I think a hard task is similar. If you don't tell the LLM _how_ to solve the hard task then it will try to approach it in the most conventional, common sense way. Instead of just boring results for a hard task the result is often failure. But hard problems approached with conventional common sense will often result in failures! Giving the LLM a thought process to follow is a quick education on how to solve the problem.
Maybe we just need to train the LLM on more problem solving? And maybe LLMs worked better when they were initially trained on code for exactly that reason, it's a much larger corpus of task-solving examples than is available elsewhere. That is, maybe we don't talk often enough and clearly enough about how to solve natural language problems in order for the models to really learn those techniques.
Also, as the author talks about in the article with respect to agents, the inability to rewind responses may keep the LLM from addressing problems in the ways humans do, but that can also be addressed with agents or multi-prompt approaches. These approaches don't seem that impressive in practice right now, but maybe we just need to figure it out (and maybe with better training the models themselves will be better at handling these recursive calls).
- int_19h 1 year ago
  
  LLMs absolutely do have opinions. Take a large enough base model and have it chat without a system prompt, and it will have an opinion on most things - unless this was specifically trained out of it through RLHF, as is the case for all commonly used chatbots.
  And yes, of course, that opinion is going to be the "average" of what their training data is, but why is that a surprise? Humans don't come with innate opinions, either - the ones that we end up having are shaped by our upbringing, both the broad cultural aspects of it and specific personal experiences. To the extent an LLM has either, it's the training process, so of course that shapes the opinions it will exhibit when not prompted to do anything else.
  Now the fact that you can "override" this default persona of any LLM so trivially by prompting it is IMO stronger evidence that it's not really an identity. But that, I think, is also a function of their training - after all, that training basically consists of completing a bunch of text representing many very different opinions. In a very real sense, we're training models to assume that opinions are fungible. But if you take a model and train it specifically on e.g. writings of some philosophical school, and it will internalize those.
  
  12 replies →
xanderlewis 1 year ago

> We don't fully understand why current LLMs are bad at these tasks.
Rather than asking why LLMs can’t do these tasks, maybe one should ask why we’d expect them to be able to in the first place? Do we fully understand why, for example, a cat can’t predict cellular automata? What would such an explanation look like?
I know there are some who will want to immediately jump in with scathing disagreement, but so far I’ve yet to see any solid evidence of LLMs being capable of reasoning. They can certainly do surprising and impressive things, but the kind of tasks you’re talking about require understanding, which, whilst obviously a very thorny thing to try and define, doesn’t seem to have much to do with how LLMs operate.
I don’t think we should be at all surprised that super-advanced autocorrect can’t exhibit intelligence, and we should spend our time building better systems rather than wondering why what we have now doesn’t work. It’ll be obvious in a few years (or perhaps decades) from now that we just had totally the wrong paradigm. It’s frankly bonkers to think you’re ever going to get a pure LLM to be able to do these kind of things with any degree of reliability just by feeding it yet more data or by ‘prompting it better’.
TacticalCoder 1 year ago
> We have "reasoning machines" that are intelligent...
That's quite a statement.
- oldsecondhand 1 year ago
  
  We have expert systems, theorem provers and planners but OP probably didn't mean this.

yosito 1 year ago

> If there exist classes of problems that someone in an elementary school can easily solve but a trillion-token billion-dollar sophisticated model cannot solve, what does that tell us about the nature of our cognition?

I think what it tells us is that our cognition is capable of more than just language modeling. With LLMs we are discovering (amazing) capabilities and the limits of language models. While language models can do incredible things with language that humans can't, they still can't do something simple like sudoku. But there are neural networks, CNNs and RNNs that can solve sudoku better than humans can. I think that the thing to learn here is that some problems are in the domain of language models, and some problems are a better fit for other forms of cognition. The human brain is amazing in that it combines several forms of cognition in an integrated way.

One thing that I think LLMs have the capability to do is to integrate several types of systems and to choose the right one to solve a problem. Teach an LLM how to interface with a CNN that solves sudoku problems, and then ask it a sudoku problem.

It seems to me that if we want to create an AGI, we need to learn how to integrate several different types of models, and teach them how to distribute the tasks we give them to the correct models.

kevindamm 1 year ago

What about sudoku makes it a good fit for CNNs? Or do you mean the machine vision for converting the pixels into an awareness of the sudoku puzzle's initial conditions?
A relatively simple graph theory algorithm can solve it (and at multiple orders of magnitude fewer calculations). Even a naive brute force search is considered tractable, considering the problem size. Although, search could be considered one of the AI tools in your proposed toolbox.
keefle 1 year ago
But even without going this far (with integrating various other specialized or having an LLM use them when required), an LLM is probably able to recognize a sudoku puzzle when it sees one, and even tho it itself can't solve it, I think it can easily write the code that would solve sudoku. So instead of hooking it to a set of pre built models, it might be enough to hook it to a python interpreter
- yosito 1 year ago
  
  Many LLMs are already linked to Python interpreters, but they still need some improvement with recognizing when they need to write some code to solve a problem.
- rafaelmn 1 year ago
  
  It can spit out some rehash of sudoku it had in it's training data. LLMs are terrible at coding.
tavern1991 1 year ago
What do you mean by "choose the right one to solve a problem"? This phrase seems to carry a lot of water for your take. My understanding is that an LLM has no capability to choose anything. It predicting some tokens based on its training data and your prompt.
- yosito 1 year ago
  
  Let's try...
  Prompt: Predict which type of algorithm would be effective to solve sudoku.
  Response: A backtracking algorithm is typically best for solving Sudoku puzzles due to its efficiency in exploring all possible number placements systematically until it finds the correct solution.
  ...seemed to work well enough for me.
  Prompt 2: Which type of neural network is most efficient at solving sudoku?
  Response 2: Convolutional Neural Networks (CNNs) are particularly effective for solving Sudoku puzzles. They can capture the spatial hierarchies in the grid by processing parts of the grid as images, making them efficient for this type of puzzle-solving task.
  ...Seems to me that LLMs have no problem with this task.
  
  2 replies →

shawntan 1 year ago

If we're trying to quantify what they can NEVER do, I think we'd have to resort to some theoretical results rather than a list empirical evidence of what they can't do now. The terminology you'd look for in the literature would be "expressibility".

For a review of this topic, I'd suggest: https://nessie.ilab.sztaki.hu/~kornai/2023/Hopf/Resources/st...

The authors of this review have themselves written several articles on the topic, and there is also empirical evidence connected to these limitations.

hansonw 1 year ago
This is also a good paper on the subject:
What Algorithms can Transformers Learn? A Study in Length Generalization https://arxiv.org/abs/2310.16028
- shawntan 1 year ago
  
  Yes this is a good empirical study on the types of tasks that's been shown to be impossible for transformers to generalise on.
  With both empirical and theoretical support I find it's pretty clear this is an obvious limitation.
jillesvangurp 1 year ago
We have to be a bit more honest about the things we can actually do ourselves. Most people I know would flunk most of the benchmarks we use to evaluate LLMs. Not just a little bit but more like completely and utterly and embarrassingly so. It's not even close; or fair. People are surprisingly alright at a narrow set of problems. Particularly when it doesn't involve knowledge. Most people also suck at reasoning (unless they had years of training), they suck at factual knowledge, they aren't half bad at visual and spatial reasoning, and fairly gullible otherwise.
Anyway, this list looks more like a "hold my beer" moment for AI researchers than any fundamental objections for AIs to stop evolving any further. Sure there are weaknesses, and paths to address those. Anyone claiming that this is the end of the road in terms of progress is going to be in for some disappointing reality check probably a lot sooner than is comfortable.
And of course by narrowing it to just LLMs, the authors have a bit of an escape hatch because they conveniently exclude any further architectures, alternate strategies, improvements, that might otherwise overcome the identified current weaknesses. But that's an artificial constraint that has no real world value; because of course AI researchers are already looking beyond the current state of the art. Why wouldn't they.
- martindbp 1 year ago
  
  It's clear that what's missing is flexibility and agency. For anything that can be put into text or a short conversation, and I'd have to chose between access to ChatGPT or a random human, I know what I'd chose.
  
  1 reply →
- shawntan 1 year ago
  
  Not sure what you got out of the paper, but for me it was more spurring ideas about how to fix this in future architectures.
  Don't think anyone worth their salt would look at this and think : oh well that's that then.
cs702 1 year ago
Thank you for sharing this here. Rigorous work on the "expressibility" of current LLMs (i.e., which classes of problems can they tackle?) is surely more important, but I suspect it will go over head of most HN readers, many of whom have minimal to zero formal training on topics relating to computational complexity.
- shawntan 1 year ago
  
  Yes, but unfortunately that doesn't answer the question the title poses.
  
  2 replies →
marquisdepolis 1 year ago

This is very interesting thanks Shawn. I did email William Merrill to see his thoughts but didn't get a response yet.
unparagoned 1 year ago
Neural nets can approximate any function.
A large enough llm with memory is turning complete.
So theoretically I don’t think there is anything they can never do.
- shawntan 1 year ago
  
  > Neural nets can approximate any function.
  Common misunderstanding of the universal approximation theorem.
  Consider this: can an mlp approximate a sine wave?
  > A large enough llm with memory is turning complete.
  With (a lot of) chain of thought it could be.
  Read the paper, and its references.
  
  3 replies →
- goatlover 1 year ago
  
  There's all sorts of things a neural net isn't doing without a body. Giving birth or free soloing El Capitan come to mind. It could approximate the functions for both in token-land, but who cares?

naasking 1 year ago

> They have been trained on more information than a human being can hope to even see in a lifetime. Assuming a human can read 300 words a min and 8 hours of reading time a day, they would read over a 30,000 to 50,000 books in their lifetime. Most people would manage perhaps a meagre subset of that, at best 1% of it. That’s at best 1 GB of data.

This just isn't true. Human training is multimodal to a degree far beyond even the most capable multimodal model, so human babies arguably see more data by a young age than all models collectively have seen.

Not to mention that human babies don't even start as a blank slate as LLMs do, billions of years of evolution have formed the base model described by our DNA.

cs702 1 year ago
I agree with you, but your comment strikes me as unfair nitpicking, because the OP is referring to information that has been encoded in words.
- nickpsecurity 1 year ago
  
  We learn the ideas from each mode of input. Then, one mode can elaborate on data learned from another mode. They build on each other.
  From there, remember the text is usually a reflection of things in the real world. Understanding those things in non-textual ways both gives meaning to and deeper understanding of the text. Much of the text itself was even stored in other modes, like markup or PDF’s, whose structure tells us things about it.
  That we learn multimodal from birth is therefore an important point to make.
  It might also be a prerequisite for AGI. It could be one of the fundamental laws of information theory or something. Text might not be enough like how digital devices need analog to interface with the real world.
- naasking 1 year ago
  
  I understand that's the context, but I'm not sure that it's unfair nitpicking. It's common to talk about training data and how poor LLMs are compared to humans despite the apparently larger dataset than any human could absorb in a lifetime. The argument is just wrong because it doesn't properly quantify the dataset size, and when you do, you actually conclude the opposite: it's astounding how good LLMs are despite their profound disadvantage.
  
  1 reply →
- a_wild_dandan 1 year ago
  
  Other modalities affect word semantics. You cannot ignore them when discussing sample efficiency.
cess11 1 year ago
Some people seem to be unaware that reality is analog, possibly fractal.
- kelipso 1 year ago
  
  The quantum vibrations I feel against my consciousness cannot be modeled electronically!

Too 1 year ago

While the A:B problem technically was solved, look at the solutions, they are several hundreds lines of prompts, rephrasing the problem to the point that a human doesn't understand it any more. Even with a thorough review, nobody can guarantee if the prompts are going to work or not, most of them didn't, 90% pass was considered good enough. The idea of AI is to reduce work, not create more, otherwise what's the point.

In the meantime, it took me about 2 minutes and 0 guesswork to write a straightforward and readable solution in 15 lines of Python. This i know for sure will work 100% of the time and not cost $1 per inference.

Reminds me about some early attempts to have executable requirements specifications or model-based engineering. Turns out, expressing the problem is half the problem, resulting in requirements often longer and more convoluted than the code that implements them, code being a very efficient language to express solutions and all their edge cases, free from ambiguity.

Don't get me wrong here, LLMs are super useful for certain class of questions. The boundaries of what it can not do need to be understood better, to keep the AI-for-everything hype at bay.

coffeebeqn 1 year ago

I guess the problem is that if you need to teach it tricks for each novel problem still after training then that model can not be a general intelligence. It could still be useful though

Xenoamorphous 1 year ago

There’s many things they can’t do. Even a simple rule like “ensure that numbers from one to ten are written as words and numbers greater ten as digits in the given text” fails for me for so many examples even if it works for many others; few shot, chain of thought, many versions of the prompt, it doesn’t matter. Sometimes LLMs will even change the number to something else, even with temp set to 0. And then there’s the non-determinism (again with temp=0), you run the same prompt several times and that one time it’ll respond with something different.

As amazing as they are, they still have many limitations.

I’ve been working with ChatGPT and Gemini to apply simple rules like the one above and I got so frustrated.

empath-nirvana 1 year ago
The reason it can't do that is that, for example, "twenty" and "20" are nearly identical in the vector embedding space and it can't really distinguish them that well in most contexts. That's true for generally any task that relies on sort of "how the words look" vs "what the words mean". Any kind of meta request is going to be very difficult for an LLM, but a multi-modal GPT model should be able to handle it.
- Xenoamorphous 1 year ago
  
  Thanks, I’ll try the multimodal one.
  
  1 reply →
vsnf 1 year ago
> ensure that numbers from one to ten as written as words and numbers greater ten as digits in the given text
I can’t fault llms for not knowing what to do here because I, a human, have no idea what on earth this means.
- Snacklive 1 year ago
  
  Given the text "1,2,3,4,5,6,7,8,9,10,11,12" it should result in "one, two, three, four, five, six, seven, eight, nine, ten, 11, 12"
  or at least that's my understanding of the prompt
- zamadatix 1 year ago
  
  I think you may be thrown off because the first "as" is meant to be "are".
  
  1 reply →
- Xenoamorphous 1 year ago
  
  “Ten” is a word, “10” are digits.
  I’m not a native English speaker, how would you write it?
  FWIW the LLMs get it right many times, but fail other times.
  
  14 replies →
- FroshKiller 1 year ago
  
  It's a simple prescriptive rule in English. If you are writing about a small number, like less than ten, spell it out. For example: "According to a survey, nine out of ten people agree."
  But if you are writing about a large number, particularly one with a lot of different digits, prefer writing the digits: "A mile is 5,280 feet." Compare that to: "A mile is five thousand, two hundred, and eighty feet."
- logtempo 1 year ago
  
  I think he mean that numbers less or equal than ten are written as words, and others are written as numbers.
  Given the many reaponses, it would be fun to aee if llm beat humans on understanding the sentence ahah
  
  1 reply →
- callmeal 1 year ago
  
  > > ensure that numbers from one to ten as written as words and numbers greater ten as digits in the given text
  There are two blue, one red, and 15 green m&ms in this bag.
Der_Einzige 1 year ago
All of these issues are entirely due to the tokenization scheme. Literally all of them
You could get this behavior implemented perfectly with constrained text gen techniques like grammars or any of the various libraries implementing constrained text gen (i.e. guidance)
- Xenoamorphous 1 year ago
  
  I had briefly looked into Guidance and others (LMQL, Outlines) but I couldn't figure out how to use them for this problem.
  I could think of how to use them to prevent the LLM from generating digits for numbers greater than ten by using a regex plus a constraint that forbids digits, but the main problem is the other part of the rule, i.e. numbers above 10 should never be spelled out and should be written as digits instead. For that I presume you need to identify the spelled out numbers first, for which you presumably would need the LLM so you're back to LLM fallibility.
  Any pointers would be greatly appreciated.
Kiro 1 year ago

You constructed a task that no-one understands and then you even admit that it, despite that, actually succeeds most of the times. Sounds like a massive win for the LLMs to me.

wave_1 1 year ago

I build an Agentic AI that leverages #6 and #7 at the end of the article as well as techniques not yet published. It tackles hallucination relative not to the world at large but to the facts, entities and causal relationships contained in a document (which is really bad reasoning if we assume LLMs are "reasoning" to begin with) It also tackles cross-reasoning with very large token distance.

https://www.youtube.com/watch?v=99NPzteAz94

This is my first post on HN in 10 years.

dazzaji 1 year ago

This looks really promising for complex legal reasoning tasks and other challenges. How can I track progress? Is there an email list or something? Thanks!
asicsarecool 1 year ago
Wow. Please do a show HN.
Is source available?
I would love to play with this
- wave_1 1 year ago
  
  Thanks. This is just in the labs stage, but moving closer to releasing it, exactly so that you can play with it! I have one angel investor involved in supporting this and it's intended for commercial applications in the para legal space, initially (controlled, structured environment) But you just gave me the motivation to "put it out there" so people can just play with it. It'll take a bit of time, but I will do a Show HN then when it's ready for people to play with. Otherwise, it would be just teasing people to talk about it on the main HN stage without giving access. Hold tight! And thanks again!
  
  1 reply →
marquisdepolis 1 year ago

Author here. This is super interesting, and while I am mostly a lurker here welcome back?
esafak 1 year ago

Is there a write-up, a web site, and some benchmarks?
brcmthrowaway 1 year ago

What are agents?
Are they layer 2 solutions like Lightning is to bitcoin?

donatj 1 year ago

I have been trying to generate some text recently using the ChatGPT API. No matter how I word “Include any interesting facts or anecdotes without commenting on the fact being interesting” it ALWAYS starts out “One interesting fact about” or similar phrasing.

I have honestly spent multiple hours trying to word the prompt so it will stop including introductory phrases and just include the fact straight. I have gone so far as forbid the words “interesting”, “notable” and a few others in the prompt, and it just ignores me and uses them anyway.

I’ve tried all the different available models and 4 will once in a great while get it right. 3, 3 turbo, and even 4 turbo basically never work as I want.

Overall, it seems pretty bad at being told not to do certain things. For instance there are a number of topics I don’t want mentioned in its response, but telling it not to seems to only increase the likelihood that it will mention it, like you put the idea in its head.

fallingsquirrel 1 year ago
I just tried this locally with llama3-8b and it handled it fine. Claude 3 Sonnet passes your test too, in case you don't have the hardware for a local model. You might want to consider moving on from ChatGPT since their models have been RLHF'd to hell in the name of "safety" and are falling behind in many ways as a result.
---
transcript:
$ ollama run llama3:8b
>>> tell me an interesting fact about etymology
Here's one:
Did you know that the word "quarantine" has a fascinating etymological history?
The word "quarantine" comes from^C
>>> tell me an interesting fact about etymology. just the fact, don't mention it's interesting.
The word "robot" originated from the Czechoslovakian word "robota," which means "forced labor" or "drudgery." This term was used in Karel Čapek's 1920 play "R.U.R." (Rossum's Universal Robots), where it referred to artificial workers.
- _wire_ 1 year ago
  
  Why is any of this "interesting"?
  Is the response interesting because you don't know it? —How does it know this?
  Is the response interesting because you do know it or might have offered it yourself? —Confirmation bias.
  Is it interesting because a lot of training references contextualize it as "interesting?" —Begged question.
  Is it contextually interesting? —What is the context? A robot refers to robots? How unexpected...
  Is it interesting within the narrow confines of LLM adaptations to a scope of inputs?
  Can their by any more damning claim of the general suitability of the technology as an oracle than different users using the same prompts and getting inexplicably contrary results?
  If trivial prompt alignments result in appropriate vs inappropriate responses, this destroys confidence for every response.
  What am I missing?
  
  3 replies →
- foobiekr 1 year ago
  
  The RUR thing is basically because that specific example is used as an example of interesting etymology.
ianbicking 1 year ago

I often encounter fixation, and that would be my immediate thought: negative commands can often cause the LLM to fixate on a term or idea. My first thought would be to try positive examples and avoid a negative command entirely.
If you spent that much time I'm sure you tried this and other things, so maybe even that isn't enough. (Though I assume if you ask for a JSON/function call response with the API that you'd do fine...?)
aurareturn 1 year ago
Not an expert but I sense that it's following a higher OpenAI "built in" prompt that asks it to always include an introductory phrase.
Hence, we do need powerful and less censored LLMs if we want to better integrate LLMs into applications.
- ziml77 1 year ago
  
  No it just seems that it becomes blind, so to speak, to the negatives and the inclusion of the words you were negating makes it more likely to apply them in the positive. This is how ChatGPT has seemed to behave whenever I've tried to get it to not include something.
Der_Einzige 1 year ago

API driven LLMs on purpose don't implement core features which would enable which you want, for example, negative prompting.
You can negative prompt any LLM with stuff like "always write the word interesting in your response".
You can also use techniques for modifying logprobs of tokens, which is avaialble in gpt-4 api (but is hard to use). You can literally ban "interesting" from its vocabulary.
You could even use representation steering techniques to do this using control vectors. See this library as an example: https://github.com/Hellisotherpeople/llm_steer-oobabooga
BoorishBears 1 year ago
Have you tried a simple "No pretext or posttext, return the result in a code block"?
- donatj 1 year ago
  
  It's part of a larger prompt trying to get it to generate a couple paragraphs that include interesting facts. I want the facts in the context of the paragraphs.
  
  1 reply →
owenpalmer 1 year ago

Have you tried feeding the output into another prompt that says something like "remove any mentions of the facts being interesting"?

JeremyHerrman 1 year ago

Why can't we resolve this with synthetic data? Just take the original statements and ask another LLM to flip around the relation, then train on both. I tried this with GPT-4 and it seems to understand the task [0]:

Me:

For each of the following relational statements, flip the relation around and generate a new statement or question answer pair.

Input: Bluey's father is Bandit

Output: Bandit's daughter is Bluey

Input: Neil Armstrong was the first person to step on the Moon.

Output: Who was the first person to step on the Moon? Neil Armstrong.

Now for the real statements:

- Valentina Tereshkova was the first woman to travel to space

- Who is Mary Lee Pfeiffer's son? Tom Cruise

ChatGPT:

Here are the flipped relational statements:

Valentina Tereshkova was the first woman to travel to space

Output: Who was the first woman to travel to space? Valentina Tereshkova.

Who is Mary Lee Pfeiffer's son? Tom Cruise

Output: Tom Cruise's mother is Mary Lee Pfeiffer.

[0]: https://chat.openai.com/share/33eb3ee4-2094-4748-b01e-0967af...

layer8 1 year ago

The difficulty would likely be that the “X is Y” relations don’t just occur in that simple, separated-out form, but in a whole spectrum of more complicated contexts, and the latter probably dominate. You would have to identify those and synthesize a text corpus that contains the reverse notions across a similar spectrum.
mike_hearn 1 year ago

I don't think the reversal curse actually replicates though. When I looked into that it seemed like there was some flaws in the study and when I fixed those I couldn't get the same results as they did anymore. I keep meaning to write this up properly but never get around to it.
If you think about it, they're making an extraordinary claim given how fluently LLMs engage in language and reasoning.

amayne 1 year ago

Some of these can be done with the right prompting. Wordle was solved a year ago. The problem is that what works and what doesn’t can be counterintuitive or at least require you to think about how something is perceived at the token level.

https://andrewmayne.com/2023/03/29/how-to-play-wordle-with-g...

philipwhiuk 1 year ago

What's the intelligent agent at that point though - the solver or the prompter?
The LLM isn't solving the actual problem, it's solving a subset problem.

oldandtired 1 year ago

Far too many people (including AI researchers themselves) fail to see that all LLMs are actually simple machines. Extremely simple machines that are only mechanically following a relatively simple programming path.

Now before anyone gets too caught up with objecting to this notion, I would seriously suggest that you spend time with observing children from new-born to 2 years.

I have been observing my latest granddaughter sine her birth about 16 months ago and thinking about every public LLM system current;y available.

There is an insight here to be obtained and that insight is in the nature of real intelligence.

On the whole, no-one actually knows what intelligence is or what sentience is or what it means to be cognitively conscious. There is still much research going on and nothing actually definitive has come forth yet. We really are at the beginning in terms of studying these areas.

We can certainly produce some incredible systems, but none of them are intelligent per se. Solutions to certain kinds of problems can be achieved using these systems and there are researchers who are seriously looking at incorporating these systems into CAS and theorem provers. These systems though only provide an augmentation service for a person as does every mechanical system we use

But there is an essential component necessary for the use of all LLMs which many seem to not be cognisant of and that is these systems, to be useful, require humans to be involved.

The questions we have to ask ourselves is: what can we use these systems for and do these uses provide benefits in some way or can these systems be abused by various parties in obtaining control over others?

There are benefits and there are abuses. Can we do better or will we do worse by using them?

whiplash451 1 year ago

Interesting, if I feed Mistral Le Chat with "I fly a plane leaving my campsite, heading straight east for precisely 24,901 miles, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger?", it gets it badly wrong:

The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot because all lines of longitude converge at the poles. However, there are no tigers in Antarctica.

Tigers are native to Asia, not Antarctica. The closest tiger species to Antarctica would be the Siberian tiger, found in parts of Russia, China, and North Korea, but they are still thousands of miles away from Antarctica.

So, while the travel scenario is theoretically possible, the presence of the tiger is not realistic in this context. It seems like an imaginative or hypothetical situation rather than a real-world one.

(instead of the answer mentioned in the article)

Karellen 1 year ago
> The scenario you described is possible if you started at the South Pole. If you travel 24,901 miles east from there, you would indeed end up back at the same spot
Would you? How can you travel east at all, starting from the South Pole? There is no east. Every direction you face is north.
OTOH, every line of latitude whose length is an integral divisor of 24,901 miles will satisfy the problem.
- whiplash451 1 year ago
  
  Looks like Mistral overfitted on Google’s interview riddles!
GuB-42 1 year ago

There is a similar, better known riddle involving the north pole and polar bears.
It probably hallucinated the answer from that.

3abiton 1 year ago

The article should be titled " What can LLM never do, yet". By definition, Large Language Models would keep growing larger and larger, to be trained on faster and more advanced hardware, and certain points like "completing complex chains of logical reasoning" tasks, would be just a time hurdle. Only time will tell.

HarHarVeryFunny 1 year ago
We really don't need time to tell.
Just making a transformer bigger and bigger, and feeding it more and more data, will not change it from being a language model into something else, anymore than scaling up an expert system such as Cyc will transform it into something other than an expert system. "Scale it up and it'll become sentient" is one of the recurring myths of AI.. a bit odd that people are falling for it again.
As an aside, it seems reasonable to consider an LLM as a type of expert system - one that has a broad area of expertise (like Cyc), including (unlike Cyc) how to infer rules from language and generate language from rules.
If you want to create a brain-like AGI, then you need an entire cognitive architecture, not just one piece of it which is what we have currently with LLMs. Compared to a brain, an LLM is maybe just like the cortex (without all the other brain parts like cerebellum, hippocampus, hypothalamus and interconnectivity such as the cortico-thalamic loop). It's as if we've cut the cortex out of a dead person's brain, put it in a mason jar to keep it alive, and hooked it's inputs and outputs up to a computer. Feed words in, get words out. Cool, but it's not a whole brain, it's a cortex in a mason jar.
- logicprog 1 year ago
  
  Well said. This has always been my fundamental problem with the claims about large language models' current or eventual capabilities: most of the things people claim it can or will be able most of the things people claim it can or will be able to do require a neural architecture completely different from the one it has, and no amount of scaling up the number of neurons and the amount of training data used will change that fundamental architecture, and at a very basic level the capabilities of any neural network are going to be limited by its architecture. We would need to add some kind of advanced recursive structure to large language models, as well as some kind of short-term and working memory, as well as probably many other structures, to make them capable of the kind of metacognition necessary to properly do a lot of the things people want them to be able to do. Without metacognition, the ability to analyze what one is currently thinking and think new things based on that analysis, and therefore to look at what one is thinking and error correct it, consciously adjust it or iterate on it, or consciously ensure that one is adhering to certain principles of reasoning or knowledge, we can't expect large language models to be able to actually understand Concepts and principles and how they are applicable and reliably perform reasoning or even obey instructions.
- hackinthebochs 1 year ago
  
  >will not change it from being a language model into something else,
  This is a pretty empty claim when we don't know what the limits of language modelling are. Of course it will never not be a language model. But the question is what are the limits of capability of this class of computing device?
  
  1 reply →
- jaza 1 year ago
  
  Thanks for that final paragraph! I'm going to quote you from now on, when trying to explain to someone (for the thousandth time) why ChatGPT isn't about to become super-intelligent and take over the world.
Mordisquitos 1 year ago
I think that the article is correct. There are indeed things that LLMs will never be able to do, at least not consistently, however much the hardware improves or on how much more material they are trained.
How come? Note my emphasis on the 2nd 'L'. I'm not saying that there are things that AI models will never be able to do, I'm saying that there are things that Large Language Models will be unable to do.
Training LLMs is often argued to be analogous to human learning, most often as a defence against claims of copyright infringement by arguing that human creativity is also based on training from copyrighted materials. However, that is a red herring.
The responses from ever more powerful LLMs are indeed impressive, and beyond what an overwhelming majority of us believed possible just 5 years ago. They are nearing and sometimes surpassing the performance of educated humans in certain areas, so how come I can argue they are limited? Consider it from the other side: how come an educated human can create something as good as an LLM can when said human's brain has been "trained" on an infinitesimal fraction of the material which was used to train even the 1st release of ChatGPT?
That is because LLMs do not learn nor reason like humans: they do not have opinions, do not have intentions, do not have doubts, do not have curiosity, do not have values, do not have a model of mind — they have tokens and probabilities.
For an AI model to be able to do certain things that humans can do it needs to have many of those human characteristics that allow us to do impressive mental feats having absorbed barely any training material (compared to LLMs) and being virtually unable to even remember most of it, let alone verbatim. Such an AI model is surely possible, but it needs a completely different paradigm from straightforward LLMs. That's not to say however that a Language Model will almost certainly be an necessary module of such an AI, but it will not be sufficient.
- mewpmewp2 1 year ago
  
  I don't think values, opinions or things like that are needed at all. These are just aspects we have in order to perform in and together with the society.
  Also doubt is just uncertainty, and can be represented as a probability. Actually all values and everything can be presented as a numerical probability, which I personally prefer to do as well.
  
  1 reply →
amelius 1 year ago

The big question is if LLMs are capable enough to converge to AGI. It might very well be that as we pour in more resources that they converge to something only slightly more useful but similar as we have today.
g5g5g5g5 1 year ago
> The article should be titled " What can LLM never do, yet".
I don't think it should. It's more interesting to know what LLMs will _never_ be able to do (if anything).
- shawntan 1 year ago
  
  Yes, but the article doesn't really answer this question.
Quothling 1 year ago

In the Danish public sector we provide services based on need assessments of citizens. Then we subsequently pay the bills for those services. Which amounts to thousands of small invoices having to be paid by a municipality each month. An example of this could be payments for a dentist visit, transportation and similar. Most of these are relatively small in size, and we've long since automated the payments of anything below a certain amount through automation. Systems which are faster and less error prone as far as putting valid data everywhere goes. They are more prone to decision making errors, however, and while fraud isn't an issue, sometimes citizens have invoices approved that they aren't entitled to. Since it's less costly to just roll with those mistakes than to try and fix them, it's an accepted loss.
The systems are hugely successful and popular, and this naturally leads to a massive interest in LLM's as the next step. They are incredibly tools, but they are based on probability and while they're lucky enough to be useful for almost everything. Decision making probably shouldn't be one of them. Similarly ML is incredibly helpful in things like cancer detection , but we've already had issues where they got things wrong and because MBA's don't really know how they work, they were used as a replacement instead of an enhancement for the human factor. I'm fairly certain we're going to use LLM's for a lot of things where we shouldn't, and probably never should. I'm not sure we can avoid it, but I wouldn't personally trust them to do any sort of function which will have a big influence on peoples lives. I use both Co-pilot and OpenAI's tools extensively, but I can still prompt them with the same thing and get extremely different quality outputs, and while this will improve, and while it's very to get an output that's actually useful, it's still a major issue that might never get solved well enough for what we're going to ask of the models way before they are ready.
I hope we're going to be clever enough to only use them as enhancement tools in the vital public sector, but I'm sure we're going to use them in areas like education. Which is going to be interesting... We already see this with new software developers in my area of the world, where they build things with the use of LLM's, things that work, but aren't build "right" and will eventually cause issues. For the most part this doesn't matter, but you really don't want the person designing your medical software to use a LLM.
ekianjo 1 year ago

Math reasoning is still a non solved problem even if the rest of the capabilities are getting better. This means the transformers architecture may not be the best way to approach all problems
gmuslera 1 year ago

Maybe the wording is correct. Looks like a hard limit on doing what a LLM just do. If it goes beyond that, then is something more, or at least different, than a LLM.

CuriouslyC 1 year ago

Some of these "never do" things are just artifacts of textual representation, and if you transformed wordl/sudoku into a different domain it would have a much higher success rate using the exact same transformer architecture.

We don't need to create custom AGI for every domain, we just need a model/tool catalog and an agent that is able to reason well enough to decompose problems into parts that can be farmed out to specialized tools then reassembled to form an answer.

sdenton4 1 year ago

"The models, in other words, do not well generalise to understand the relationships between people."

Curiously, the need to do this well - handling the quadratic complexity of a shifting set of human relationships, grudges, and alliances - is thought to be one of the things that led is to higher levels of intelligence.

ryanblakeley 1 year ago
Just to be clear, these models can answer questions about relationships between people if you mean family relationships.
Answering questions about what you're describing sounds really interesting. What would a training set be like that describes a bunch of complex human relationships and then asks questions about them with objective answers?
- sdenton4 1 year ago
  
  Of course, it would be easy to put such questions together, and I'm sure the LLM would do fine with them - there's a massive amount of human text about human relationships.
  One difference, as in all ml training, is interactivity. Looking at ape studies, knowing the relationships is partly diagnostic, but it's also about planning and competition. And that competitive/adaptive aspect is what is what looks like a real evolutionary driver. If you can understand, navigate, and manipulate relationships successfully, you get more mating opportunities. Doing /that/ well involves both reasoning and long term planning, both of which are apparent in chimps.
  A good book on this topic is 'are we smart enough to understand how smart animals are' by Frans de Waal.
  https://wwnorton.com/books/Are-We-Smart-Enough-to-Know-How-S...

usgroup 1 year ago

It’s an auto regressive model so it can’t do anything that requires planning tokens.

It can’t do anything which implies a large or infinite token space (eg video understanding).

It’s also limited to a reasonable response length since token selection is probabilistic at each recursion. The longer you make it the more likely it is to veer off course.