Comment by scribu
2 days ago
This seems to be a system to generate better prompts to be fed into a base multimodal model.
Interesting, but title is definitely clickbait.
2 days ago
This seems to be a system to generate better prompts to be fed into a base multimodal model.
Interesting, but title is definitely clickbait.
They only did that for image generation. The more interesting part is that an LLM can approach or find the correct caption for an image, video or audio during test time with no training using only the score as a guide. It's essentially working blind almost like the game Marco Polo where the scorer is saying "warmer" or "colder" while the LLM is finding its way towards the goal. This is an example of emergent capabilities since there are no examples of this in the training data.
Actually, it's the name of the paper. And while the team also developed and released a system to elicit the behavior by doing what you described, it's entirely possible that the researchers thought the title to be the most important finding in their work.
Exactly! There is definitely something wrong with FAIR.