← Back to context

Comment by ponyous

10 hours ago

Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.

Here are the results compared to Gemini 3.5 Flash:

    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%

Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.

Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

  • I don't have the eval results live yet, so I cannot share them yet.

    I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...

    I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.

    Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

        <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
        <0.4 → Weak – Partially relevant; significant omissions or errors.
        <0.6 → Fair – Covers main points but lacks completeness or precision.
        <0.8 → Good – Mostly accurate; minor gaps or deviations.
        <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
    

    Here is the scenario list (prompts are much more detailed):

        dragon-bottle-stopper
        editing-param-mid-conv
        editing-parametric-enclosure
        editing-swap-material-param
        editing-text-edit-cube
        multi-turn-bird-house
        multi-turn-dice-tower
        multi-turn-modular-planter
        multi-turn-phone-stand
        multi-turn-shelf
        one-shot-bookend
        one-shot-cable-clip
        one-shot-chess-queen
        one-shot-coaster
        one-shot-coffee-cup
        one-shot-dog-tag
        one-shot-dragon-figurine
        one-shot-hex-bracket
        one-shot-keychain-fob
        one-shot-low-poly-tree
        one-shot-pegboard-hook
        one-shot-pi4-case
        one-shot-threaded-jar
    
    
    

    [0]: https://grandpacad.com

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

  • Absolutely. Running it now, will update this comment in about 30 mins.

    Edit: Surprisingly very good results with 3.0 flash with high thinking.

    Cost: $0.06

    Duration: 3.22 min

    Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)

    Adherence was on par with 3.5 flash Low thinking

    • Thanks! I’ve still been using 3.0 a lot, the price-to-performance ratio absolutely kills compared to Google’s other and newer offerings.