Comment by tikkun

2 years ago

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct

29 comments

tikkun

jedberg 2 years ago

This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.

mratsim 2 years ago
Exact instruction challenge https://www.youtube.com/watch?v=cDA3_5982h8
- HPsquared 2 years ago
  
  "What programming computers is really like."
  EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.

layer8 2 years ago

It’s not impressive that one has to go to that length though.

unshavedyak 2 years ago

Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.
There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.
mattnewton 2 years ago
You can always find something to be unimpressed by I suppose, but the fact that this was fixable with plain english is impressive enough to me.
- layer8 2 years ago
  
  The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.
  *) that is, except sometimes by making adjustments to the system prompt
  
  1 reply →
- diffeomorphism 2 years ago
  
  The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.
- psb217 2 years ago
  
  Well, the answer is probably between 1 and 10, so if you try enough prompts I'm sure you'll find one that "works"...
petesergeant 2 years ago

> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"
Spivak 2 years ago
To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.
- emmelaich 2 years ago
  
  I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'
  Even some people will make this association, it's no surprise that LLMs do.
- layer8 2 years ago
  
  The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.
jonas21 2 years ago
To be fair, I just asked a real person and had to go to even greater lengths:
Me: How many "r"s are in strawberry?
Them: What?
Me: How many times does the letter "r" appear in the word "strawberry"?
Them: Is this some kind of trick question?
Me: No. Just literally, can you count the "r"s?
Them: Uh, one, two, three. Is that right?
Me: Yeah.
Them: Why are you asking me this?
- tedunangst 2 years ago
  
  You need to prime the other person with a system prompt that makes them compliant and obedient.
- ukuina 2 years ago
  
  I look forward to the day when LLM refusal takes on a different meaning.
  "No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."
- SirMaster 2 years ago
  
  Try asking a young child...
ThrowawayTestr 2 years ago

Compared to chat bots of even 5 years ago the answer of two is still mind-blowing.
asadm 2 years ago
this can be automated.
- grumbel 2 years ago
  
  GPT4o already does that, for problems involving math it will write small Python programs to handle the calculations instead of doing it with the LLM itself.
  
  2 replies →

pegasus 2 years ago

Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".

tcgv 2 years ago

Chain-of-Thought (CoT) prompting to the rescue!

We should always put some effort into prompt engineering before dismissing the potential of generative AI.

IncreasePosts 2 years ago
Why doesn't the model prompt engineer itself?
- tcgv 2 years ago
  
  Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.
  And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.
johntb86 2 years ago

By this point, instruction tuning should include tuning the model to use chain of thought in the appropriate circumstances.

hansworst 2 years ago

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?