Comment by danpalmer

13 hours ago

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

37 comments

danpalmer

p-e-w 9 hours ago

> due to fundamental limitations

People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.

aDyslecticCrow 2 hours ago
> character counting
The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
- p-e-w 1 hour ago
  
  > The models now whaste a vast amount of useless neurons memorising the character count the entire English language
  No they don’t. They only need to know the character count for each token, and with typical vocabularies having around 250k entries, that’s an insignificant number for all but the tiniest LLMs.
  
  1 reply →
coldtea 7 hours ago
>People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com
- p-e-w 7 hours ago
  
  That paper’s abstract doesn’t carry its title, to put it mildly.
  
  3 replies →
dijit 9 hours ago

Character counting remains a huge issue without tools.
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
girvo 6 hours ago
The literal best public models still fail to count characters consistently in practice so I’m not sure what you mean. It’s literally a problem we’re still trying to solve at work
- outofpaper 5 hours ago
  
  What's amazing is that they even can fairly reliably appear to count characters. I mean we're talking about systems that infer sequences not character counters or calculators. They are amazing in unrelated ways and we need to accept this so we can use them effectively.
  
  2 replies →
3form 6 hours ago

Is character counting actually not an issue anymore? Do you know somewhere where I can read more about this?
3form 5 hours ago

Your comment, after removing the particulars, has a shape of:
People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.
As such, I am not sure what you're trying to achieve here.
mrob 5 hours ago
Character counting errors are a side effect of tokenization, which is a performance optimization. If we scaled the hardware big enough we could train on raw bytes and avoid it.
- teiferer 3 hours ago
  
  No, tokenization is not the only reason. A next-word predictor has fundamentally a hard time executing algorithms, even as simple as counting.
  
  1 reply →
raincole 5 hours ago

Drawing five fingered humans was a fundamental limitation... until it's not.
danpalmer 8 hours ago
This is kind of my point, we need to get better at describing the limitations and study them. It seems extremely clear that there are limitations, and not just temporary ones, but structural limitations that existed at the beginning and continue to persist.
- ijidak 5 hours ago
  
  Yeah I think it was the word "fundamental" he took issue with.
Marazan 7 hours ago
If you remove the auxiliary tools and just leave the core LLM then strawberry still has an undefined number of `r`s in it.
- p-e-w 7 hours ago
  
  That’s false. Larger LLMs learn token decompositions through their training, and in fact modern training pipelines are designed to occasionally produce uncommon tokenizations (including splitting words into individual characters) for this reason. Frontier models have no trouble spelling words even without tools. Even many mid-sized models can do that.
  
  2 replies →
rimliu 9 hours ago
of course, if you choose to ignore all the limitations they indeed have no limitations.
- mkbosmans 8 hours ago
  
  Nobody says they have no limitations. The question is are those limitation fundamental, i.e. can we expect improvement, say within a year.
  
  5 replies →

locknitpicker 10 hours ago

> There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.

Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.

Sammi 6 hours ago

In short, LLMs are pretty great at working at a single level of abstraction at a time.
You can go from the highest level and all the way down to the lowest level with LLMs, you just have to work at it iteratively one level at a time.
danpalmer 9 hours ago

I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.
nullsanity 9 hours ago

[dead]