Comment by lanewinfield

3 months ago

hi, I made this. thank you for posting.

I love clocks and I love finding the edges of what any given technology is capable of.

I've watched this for many hours and Kimi frequently gets the most accurate clock but also the least variation and is most boring. Qwen is often times the most insane and makes me laugh. Which one is "better?"

55 comments

lanewinfield

jdietrich 3 months ago

Clock drawing is widely used as a test for assessing dementia. Sometimes the LLMs fail in ways that are fairly predictable if you're familiar with CSS and typical shortcomings of LLMs, but sometimes they fail in ways that are less obvious from a technical perspective but are exactly the same failure modes as cognitively-impaired humans.

I think you might have stumbled upon something surprisingly profound.

https://www.psychdb.com/cognitive-testing/clock-drawing-test

overfeed 3 months ago
> Clock drawing is widely used as a test for assessing dementia
Interestingly, clocks are also an easy tell for when you're dreaming, if you're a lucid dreamer; they never work normally in dreams.
- ghurtado 3 months ago
  
  In lucid dreams there's a whole category of things like this: reading a paragraph of text, looking at a clock (digital or analog), or working any kind of technology more complex than a calculator.
  For me personally, even light switches have been a huge tell in the past, so basically almost anything electrical.
  I've always held the utterly unscientific position that this is because the brain only has enough GPU cycles to show you an approximation of what the dream world looks like, but to actually run a whole simulation behind the scenes would require more FLOPs than it has available. After all, the brain also needs to run the "player" threads: It's already super busy.
  Stretching the analogy past the point of absurdity, this is a bit like modern video game optimizations: the mountains in the distance are just a painting on a surface, and the remote on that couch is just a messy blur of pixels when you look at it up close.
  So the dreaming brain is like a very clever video game developer, I guess.
  
  13 replies →
- danw1979 3 months ago
  
  For me it’s phones… specifically dialling a number manually. No matter how carefully I dial, the number on the screen is rarely correct.
  
  4 replies →
- biztos 3 months ago
  
  Do they look normal but just not work normally?
  Maybe reality is a world of broken clocks, and they only “work” in the simulation.
- teaearlgraycold 3 months ago
  
  I feel like the heuristic could just be - do I feel like I’m in a dream? Then I am. I’ve never felt that way when awake.
xrisk 3 months ago

Maybe explainable via the fact that these tests are part of the LLM training set?
jorgesborges 3 months ago

Conceptual deficit is a great failure mode description. The inability to retrieve "meaning" about the clock -- having some understanding about its shape and function but not its intent to convey time to us -- is familiar with a lot of bad LLM output.
BHSPitMonkey 3 months ago

I would think the way humans draw clocks has more in common with image generation models (which probably do a bit better with this task overall) than a language model producing SVG markup, though.
ACCount37 3 months ago
LLMs don't do this because they have "people with dementia draw clocks that way" in their data. They do it because they're similar enough to human minds in function that they often fail in similar ways.
An amusing pattern that dates back to "1kg of steel is heavier of course" in GPT-3.5.
- kaffekaka 3 months ago
  
  How do you know this?
  Obviously, humans failing in these ways ARE in the training set. So it should definitely affect LLM output.
  
  1 reply →
TheJoeMan 3 months ago
Figure 6 with the square clock would be a cool modern art piece.
- yencabulator 3 months ago
  
  I have had this thought of a slow-moving mechanical simulation of a chaotic triple pendulum as a clock hand for a very long time..
  Or maybe something like https://www.youtube.com/watch?v=dhZxdV2naw8

bspammer 3 months ago

If you're keeping all the generated clocks in a database, I'd love to see a Facemash style spin-off website where users pick the best clock between two options, with a leaderboard. I want to know what the best clock Qwen ever made was!

abixb 3 months ago
We might be on to creating a new crowd-ranked LLM benchmark here.
- addandsubtract 3 months ago
  
  A pelican wearing a working watch
  
  1 reply →
nightpool 3 months ago

Yes! Please do this
layer8 3 months ago

Not the best, but the most amusing.
susu1111 3 months ago

[dead]

smusamashah 3 months ago

Please make it show last 5 (or some other number) of clocks for each model. It will be nice to see the deviation and variety for each model at a glance.

charliewallace 3 months ago

Very cool! I also love clocks, especially weird ones, and recently put up this 3D Moebius Strip clock, hope you like it: https://www.mobiusclock.com

chemotaxis 3 months ago

This is honestly the best thing I've seen on HN this month. It's stupid, enlightening... funny and profound and the same time. I have a strong temptation to pick some of these designs and build them in real life.

I applaud you for spending money to get it done.

AnonHP 3 months ago

Could you please change and adjust the positions of the titles (like GPT 5)? On Firefox Focus on iOS, the spacing is inconsistent (seems like it moves due to the space taken by the clock). After one or two of them, I had to scroll all the way down to the bottom and come back up to understand which title is linked to which clock.

anigbrowl 3 months ago

I really like this. The broken ones are sometimes just failures, but sometimes provide intriguing new design ideas.

jdiff 3 months ago
This same principle is why my favorite image generation model is the earlier models from 2019-2020 where they could only reliably generate soup. It's like Rorschach tests, it's not about what's there, it's about what you see in them. I don't want a bot to make art for me, sometimes I just want some shroom-induced inspirational smears.
- nemomarx 3 months ago
  
  I really miss that deepdream aesthetic with the dogs eyes popping up everywhere.

ks2048 3 months ago

Nice job! Maybe let users click an example to see the raw source (LLM output)

brianjking 3 months ago

This is an awesome benchmark. Officially one of my favorites now. Thank you for making this.

csours 3 months ago

LOVE IT!

It would be really cool if I could zoom out and have everything scale properly!

Fabricio20 3 months ago

Why is this different per user? I sent this to a few friends and they all see different things from what i'm seeing, for the same time..?

samtheprogram 3 months ago
It regenerates on page load. I find that pretty useful.
Grok 4 and Kimi nailed it the first time for me, then only Kimi on the second pass.
- malfist 3 months ago
  
  Not on page load, it regenerates every minute. There's a little hovering question mark in the top right that explains things, including the prompt to the models.
layer8 3 months ago

It’s different per minute, not per user.

hakcermani 3 months ago

.. would you mind sharing the prompt .. in a gist perhaps .

ceroxylon 3 months ago

They have it available on the site under the (?) button:
"Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting."