Updated LLM Benchmark (Gemini 3 Flash)

15 days ago (entropicthoughts.com)

Good article. So this is sort of a tangent, but here's a bit of advice as someone who makes heavy use of GenAI imagery in the service of articles that I write.

Never use out-of-the-box images of CRT computers - 99% of the time the keyboards are an ergonomic train wreck and the text on the screen is a smeary blurry mess.

See the image here for a good example:

https://mordenstar.com/blog/win9x-hacks

This is a combination of a GenAI image from NB Pro layered with a loop of the Win95 start sequence into a single animated gif. Notice I sidestepped the inclusion of a keyboard altogether.

Now, more topically: since the actual list of IF games doesn't appear to be a secret, I think it would have been better to feature it more prominently in the article, rather than tucking it away in a side note in the footer.

  • > Never use out-of-the-box images of CRT computers

    Thanks for the feedback! I'm very new to GenAI imagery and still finding my feet.

    Seeing the results, I definitely considered compositing a real photograph of a computer with the rest of the landscape, but ended up deciding against it on account of a lack of time.

    • Cool - GenAI image generation is a deep rabbit hole that you're about to fall into!

      Super happy that you pit LLMs against relatively recent IF to mitigate cheating through pre-existing training data as well.

      FYI I've been running a SOTA model comparison site for about a year now that looks at prompt adherence across local (Qwen-Image, Flux) vs proprietary (NB Pro, Seedream) that might help give an idea where the capabilities are today.

      https://genai-showdown.specr.net

      2 replies →