Comment by bandrami

4 days ago

Yesterday I asked mistral to list five mammals that don't have "e" in their name. Number three was "otter" and number five was "camel".

phi4-mini-reasoning took the same prompt and bailed out because (at least according to its trace) it interpreted it as meaning "can't have a, e, i, o, or u in the name".

Local is the only inference paradigm I'm interested in, but these things have a way to go.

I don't really see the problem here. Yeah, we know that these models are not good for actual logic. These models are lossy data compression and most-likely-responses-from-internet-forums-and-articles machines.

This kind of parlor tricks are not interesting and just because a model can list animals with or without some letters in their names doesn't mean anything especially since it isn't like the model "thinks" in English it just gives you the answer after translating it to English.

These are funny, like how you can do weird stuff with JavaScript language by combining special characters, but that doesn't really mean anything in the grand scheme of things. Like JavaScript these models despite their specific flaws still continue to deliver value to people using them.

  • You don't see the problem with a multi billion dollar project not able to give a correction answer to a trivial question? This tech is supposed to revolutionize business, increase productivity to unfathomable levels, automate all our dull boring tasks so we can focus on interesting things! Where have you been the past 4 years?

    • This. Part of my role is assessing and recommending what if any AI implementations we might add to our production and I did this experiment because my boss's boss did it himself first and sent me a screenshot with the caption "concerning" (though he got "tiger" as his animal). It's going to be a hard sell for more complicated things as long as it makes catastrophic mistakes like this on simple things.

    • Billion-dollar businesses had trouble answering trivial questions before AI. The promise of LLMs is that it could actually improve the situation!

  • Is this parlour trick so different from useful tasks like “implement this feature while following the naming conventions of my project”?

    • The difference is that in a software project you can throw more than one instance of the model at the code. If you tell it to follow your naming conventions and it fails to do so, that can be picked up by an instance of the same LLM that's running checks before you commit anything. Even though it's the same model it'll usually detect stuff like that. You can even have it do multiple passes.

      The way most people are coding with AI today is like Baby's First AI™ compared to how we'll all be using LLMs for coding in the future. Soon that "double check everything" step will be built in to the coding agents and you'll have configuration options for how many passes you want it to perform (speed VS accuracy tradeoff).

    • From the model's perspective it's completely different. LLMs have no concept of what a letter is due to the way they're trained.

Models will always struggle with this specific task without tool use, because of the way they tokenize things. I think a bit of prompt engineering, asking it to spell out each work or giving it the ability to run a “contains e” python function on a lot of animal names it generates or searches for solves this.

Lots of local ai use cases I think are solvable similarly once local models get good at tool use and have the proper harness.

  • The problem with tool use is that I usually find I only need it for one component of a pipeline. So in this case mentally I would be tooling it as

    cat /usr/share/dict/words | print_if_mammal | grep -v 'e'

    but I don't know of a good way to incorporate an LLM into a pipeline like that (I know there's a Python API). What I'm actually interested in is "is this the name of a mammal?" but I don't know of the equivalent of a quiet "batch mode" at least for ollama (and of course performance).

    I guess ultimately I would want to say "write a shell utility that accepts a line from standard input and prints it to standard output if that is the name of a mammal", and then use that utility in that pipeline. Or really to have an llmfilter utility that lets you do something like

    cat /usr/share/dict/words | llmfilter "is this a mammal?" | grep -v "e"

    and now that I've said that I think I'll try to make one.

    • This exists with Claude code / cursor agent, just agent -p or claude -p.

      But I think the more powerful thing is “I want a storybook of mamals, one for each letter” -> local LLM that plans to use search for a list of animals, filters them by starting letter and picks one for each, and maybe calls a diffusion model for pictures or fetches Wikipedia to be get context to write a blurb about it.

      The key unlock imo is the local LLM recognizing the limits of it’s own ability and completing tool use calls, rather than trying to one shot it with next word completion with its limited parameter count.

Treat LLMs as dyslexic when it comes to spelling. Assess their strengths and weaknesses accordingly.

  • They're literally text generators so that's... troubling

    • They're text generators, but you can think of them as basically operating with a different alphabet than us. When they are given text input, it's not in our alphabet, and when they produce text output it's also not in our alphabet. So when you ask them what letters are in a given word, they're literally just guessing when they respond.

      Rather, they use tokens that are usually combinations of 2-8 characters. You can play around with how text gets tokenized here: https://platform.openai.com/tokenizer

      _____

      For example, the above text I wrote has 504 characters, but 103 tokens.

      1 reply →

    • There are incredible authors who happen to be dyslexic, and brilliant mathematicians who struggle with basic arithmetic. We don't dismiss their core work just because a minor lemma was miscalculated or a word was misspelled. The same logic applies here: if we dismiss the semantic capabilities of these models based entirely on their token-level spelling flaws, we miss out on their actual utility.