Comment by tikkun
1 year ago
When using a prompt that involves thinking first, all three get it correct.
"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."
Llama 405b: correct
Mistral Large 2: correct
Claude 3.5 Sonnet: correct
This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"
At that point it was easier to do it myself.
Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8
"What programming computers is really like."
EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.
It’s not impressive that one has to go to that length though.
Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.
There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.
You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.
The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.
*) that is, except sometimes by making adjustments to the system prompt
1 reply →
The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.
Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...
> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.
I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'
Even some people will make this association, it's no surprise that LLMs do.
The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.
To be fair, I just asked a real person and had to go to even greater lengths:
Me: How many "r"s are in strawberry?
Them: What?
Me: How many times does the letter "r" appear in the word "strawberry"?
Them: Is this some kind of trick question?
Me: No. Just literally, can you count the "r"s?
Them: Uh, one, two, three. Is that right?
Me: Yeah.
Them: Why are you asking me this?
You need to prime the other person with a system prompt that makes them compliant and obedient.
I look forward to the day when LLM refusal takes on a different meaning.
"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."
Try asking a young child...
Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.
this can be automated.
GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.
2 replies →
Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.
For example, the latter model answered with:
To count the number of Rs in the word "strawberry", I'll break it down step by step:
Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3
There are 3 Rs in the word "strawberry".
Chain-of-Thought (CoT) prompting to the rescue!
We should always put some effort into prompt engineering before dismissing the potential of generative AI.
Why doesn't the model prompt engineer itself?
Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.
And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.
By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.
Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.
Maybe the various chat interfaces already do this behind the scenes?