← Back to context

Comment by ugh123

3 months ago

Cool, and marginally informative on the current state of things. but kind of a waste of energy given everything is re-done every minute to compare. We'd probably only need a handful of each to see the meaningful differences.

It's actually quite fascinating if you watch it for 5 minutes. Some models are overall bad, but others nail it in one minute and butcher it in the next.

It's perhaps the best example I have seen of model drift driven by just small, seemingly unimportant changes to the prompt.

  • > model drift driven by just small, seemingly unimportant changes to the prompt

    What changes to the prompt are you referring to?

    According the comment on the site, the prompt is the following:

    Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.

    The prompt doesn't seem to change.

    • The time given to the model. So the difference between two generations is just somethng trivially different like: "12:35" vs 12:36"

    • presumably the time is replaced with the actual current time at each generation. I wonder if they are actually generated every minute or if all 6480 permutations (720 minutes in a day * 9 llms) were generated and just show on a schedule

  • It is really interesting to watch them for a while. QWEN keeps outputting some really abstract interpretations of a clock, KIMI is consistently very good, GPT5's results line up exactly with my experience with its code output (overly complex and never working correctly)

  • We can't know how much is about the prompt though and how much is just stochastic randomness in the behavior of that model on that prompt, right? I mean, even given identical prompts, even at temp 0, models don't always behave identically.... at least, as far as I know? Some of the reasons why are I think still a research question, but I think its a fact nonetheless.

  • Kimi seems the only reliable one which is a bit surprising, and GPT 4o is consistently better than GPT 5 which on the other hand is unfortunately not surprising at all.

I sort of assumed they cached like 30 inferences and just repeat them, but maybe I'm being too cynical.

The energy usage is minuscule.

  • It's wasteful. If someone built a clock out of 47 microservices that called out to 193 APIs to check the current time, location, time zone, and preferred display format we'd rightfully criticize it for similar reasons.

    In a world where Javascript and Electron are still getting (again, rightfully) skewered for inefficiency despite often exceeding the performance of many compiled languages, we should not dismiss the discussion around efficiency so easily.

    • Yes it is wasteful.

      But I presume you light up Christmas lights in December, drive to the theater to watch a movie or fire up a campfire on holiday. That too is "wasteful". It's not needed, other, or far more efficient ways exist to achieve the same. And in absolute numbers, far more energy intensive than running an LLM to create 9 clocks every minute. We do things to learn, have fun, be weird, make art, or just spend time.

      Now, if Rolex starts building watches by running an LLM to drive its production machines or if we replace millions of wall clocks with ones that "Run an LLM every second", then sure, the waste is an actual problem.

      Point I'm trying to make is that it's OK to consider or debate the energy use of LLMs compared to alternatives. But that bringing up that debate in a context where someone is creative, or having a fun time, its not, IMO. Because a lot of "fun" activities use a lot of energy, and that too isn't automatically "wasteful".

    • What I find amusing with this argument is that, no one ever brought power savings when e.g. used "let me google that for you" instead of giving someone the answer to their question, because we saw the utility of teaching others how to Google. But apparently we can't see the utility of measuring the oversold competence of current AI models, given sufficiently large sampling size.

    • Let's do some math.

      60x24x30 = 40k AI calls per month per model. Let's suppose there are 1000 output tokens (might it be 10k tokens? Seems like a lot for this task). So 40m tokens per model.

      The price for 1m output tokens[0] ranges from $.10 (qwen-2.5) to $60 (GPT-4). So $4/mo for the cheapest, and $2.5k/mo for the most expensive.

      So this might cost several thousand dollars a month? Something smells funny. But you're right, throttling it to once an hour would achieve a similar goal and likely cost less than $100/mo (which is still more than I would spend on a project like this).

      [0] https://pricepertoken.com/

      1 reply →