Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)
It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.
Yes, sort of. Generally you can measure the pass rate on a benchmark given a fixed compute budget. A sufficiently smart model can hit a high pass rate with fewer tokens/compute. Check out the cost efficiency on https://artificialanalysis.ai/ (say this posted here the other day, pretty neat charts!)
Statistically. Do many trials and measure how often it succeeds/fails.
This is the only correct take. The only metric that matters is cost per desired outcome.
Repetition and statistics, if you have $1000++ you didn't need anyway.
It's much easier to measure a language model's intelligence than a human's because you can take as many samples as you want without affecting its knowledge. And we do measure human intelligence.