Comment by pegasus

1 year ago

Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.

I agree with you, but that's not what the post claims. From the article:

"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."

Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.

Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.

That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.