Comment by vessenes
2 days ago
I’ve read the paper and the skeptical comments here, to wit: it’s just an actor/critic pipeline by another name.
I’ll bite and say this is actually interesting — and the paper title is misleading.
What they’ve done here is hooked up a text-only LLM to multimodal critics, given it (mostly) an image diffusion generation task, and asked it to improve its prompting of the multimodal generation by getting a set of scores back.
This definitely works, based on their outputs. Which is to say, LLMs can, zero shot, with outside tool feedback, iteratively improve their prompting using only that tooling feedback.
Why is this interesting? Well, this did not work in the GPT-3 era; it seems to do so now. I see this as an interesting line to be added in the ‘model capabilities’ box as our models get larger and more sophisticated — the LLMs can perform some sort of internally guided search against a black box generator and use a black box scorer to improve at inference time.
That’s pretty cool. It’s also generalizable, and I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
> zero shot
I really wish we would find a different term for this.
Doing something always takes at least one attempt, i.e. "one shotting". "Zero shotting" is an oxymoron, which makes it a term that only creates more confusion rather than succinctly conveying something.
"One shot" is simply about the action itself, but it says nothing about how much preparation was done beforehand. "Zero shot" additionally implies without training or preparation.
TCGs have a related "zero turn win" concept, where the opponent goes first and you win without getting a turn due to the set of cards you randomly drew and being able to activate them on the opponent's turn.
I think of a shot as an example, not a try: “One shot” is “One example”. Zero shot is “Zero examples”. I don’t love it, but I don’t hate it, got a better word for it?
We already have a term for it in people, "intuited". When you are asked to intuit something, it usually implies an unfamiliarity with the subject matter.
There is such entrenchment with terms though, it'll never get shifted to that.. and on top of that, it doesn't sound as interesting and dynamic as "zero shotting".
1 reply →
I mean... how about "example"? I feel as if I were to give what you just said to someone a hundred years ago, with no context of AI training or even of the discussion, the very form of what your response leads to the answer "example" ;P.
The issue with "shot" is that it is a term and part of an idiom that has been used for a very long time and, critically, is relevant to the same problem space in a much more intuitive way: to count the number of shots shot, not shots seen.
We say Sure Shot.
It's a shot from position zero
No it isn't. The number of shots (examples) is zero.
1 reply →
My favorite AI term to ridicule is the recent "Test Time Compute" nonsense, which has nothing whatsoever to do with testing. It literally just means "inference time".
And if I hear someone say "banger", "cooking", "insane", or "crazy", one more time I'm going to sledge hammer my computer. Can't someone, under 40 please pick up a book and read. Yesterday Sam Altman tried to coin "Skillsmaxxing" in a tweet. I threw my coffee cup at my laptop.
Speaking of old-timers and "inference time" - there was a time when "inference" meant inferring parameters from data (i.e. training). And now it means "test-time". (or maybe the difference is if it's statistics community vs ML community).
e.g. Bishop's textbook says:
5.2.4 Inference and decision
We have broken the classification problem down into two separate stages, the inference stage in which we use training data to learn a model for p(Ck|x) and the subsequent decision stage in which we use these posterior probabilities to make op- timal class assignments.
1 reply →
It makes quite a lot of sense juxtaposed with "train time compute". The point being made is that a set budget can be split between paying for more training or more inference _at test time_ or rather _at the time of testing_ the model. The word "time" in "inference time" plays a slightly different role grammatically (noun, not part of an adverbial phrase), but comes out to mean the same thing.
1 reply →
Get off my lawn is alive and well it seems
1 reply →
Array indexing can start at 0 or 1.
For an array of zero shots, the indexing doesn’t matter.
> I think is worth keeping in mind on the stack of possible approaches for, say agentic coding, that you can use a critic to not just ‘improve’ generated output, but most likely do some guided search through output space.
The one issue I keep finding with those approaches is that there’s already good tools for the problem, but we keep searching for wasteful approaches because “natural languages” for something humans are not going to interact without a good deal of training.
I do understand the hope of getting LLMs do the bulk of the work, and then after audit, we fix the errors. But both audit and fixing will require the same mental energy as writing the code in the first place. And possibly more time.
Specialist tools are always more expansive and offer more controls than general public tools. Most approaches with agentic coding is offering general interfaces instead of specialized interfaces, but redirecting you to a bespoke and badly designed specialized interface whenever you want to do anything useful.
I hear that. Counterpoint - if you all you have is a Philips-head screwdriver, all you have is a Philips-head screwdriver. On the other hand if all you have is a six axis CnC mill, well, then you have a lot.
I think of this less as audit misses, and more as developing a permanently useful tool. For open model weights, humanity will not (unless we’re talking real zombie apocalypse scenarios) lose these weights. They are an incredible global asset, so making them more generally useful and figuring out how to use them is super helpful.
Maybe they are useful. But I think there’s more usefulness in specialized databases and optimized approaches than betting everything on big llms models. Kinda like deriving linting rules and combining it with a rule engines to catch errors. Efficient and useful instead of continuously running a big llm model.
While it is hard to argue with the wisdom of crystallizing intellectual capital into our tools, I do wonder if these models might be as likely to diminish as to develop the person using them, in which case we trade an implement's iterative improvement for ours, in a way
1 reply →
Are they using the same diffusion models as the GPT-3 area? Meaning is it the LLM that has improved or is it the diffusion model? I know it's probably a foolish take but I am really skeptical of the "larger models will solve all our problems" line of thinking.
They don’t compare in the paper. I will say I experimented extensively with GPT-3 era LLMs on improving ouput by trying to guide early diffusion models with critical responses. It was a) not successful, and b) pretty clear to me that GPT-3 didn’t “get” what it was supposed to be doing, or didn’t have enough context to keep all this in mind, or couldn’t process it properly, or some such thing.
This paper has ablations, although I didn’t read that section, so you could see where they say the effectiveness comes from. I bet you thought that it’s emergent from a bunch of different places.
FWIW, I don’t think LLMS will solve all our problems, so I too am skeptical of that claim. I’m not skeptical of the slightly weaker “larger models have emergent capabilities and we are probably not done finding them as we scale up”.
> FWIW, I don’t think LLMS will solve all our problems, so I too am skeptical of that claim. I’m not skeptical of the slightly weaker “larger models have emergent capabilities and we are probably not done finding them as we scale up”.
100% agree. I'd classify the time now as identifying the limits of what they can functionally do though, an it's a lot!