Comment by lern_too_spel

9 days ago

If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle.

I am building pretty much the same product as OP, and have a pretty good harness to test LLMs. In fact I have run a tons of tests already. It’s currently aimed for my own internal tests, but making something that is easier to digest should be a breeze. If you are curious: https://grandpacad.com/evals

building a benchmark is a great idea, thanks, maybe I will have a couple of days to spend on this soon