Comment by zurichisstained
14 hours ago
Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:
``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo
https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7
https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro
Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.
(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m
Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.
I'm slightly confused by your example. What's the actual prompt? Is your expectation that a text model is going to know how to perform the exact song in audio?
Ohhh absolutely not, that would be pretty wild - I just wanted to see if it could understand musical notation enough to come up with the correct melody.
I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.
My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).