Comment by jdietrich

3 months ago

Clock drawing is widely used as a test for assessing dementia. Sometimes the LLMs fail in ways that are fairly predictable if you're familiar with CSS and typical shortcomings of LLMs, but sometimes they fail in ways that are less obvious from a technical perspective but are exactly the same failure modes as cognitively-impaired humans.

I think you might have stumbled upon something surprisingly profound.

https://www.psychdb.com/cognitive-testing/clock-drawing-test

30 comments

jdietrich

overfeed 3 months ago

> Clock drawing is widely used as a test for assessing dementia

Interestingly, clocks are also an easy tell for when you're dreaming, if you're a lucid dreamer; they never work normally in dreams.

ghurtado 3 months ago
In lucid dreams there's a whole category of things like this: reading a paragraph of text, looking at a clock (digital or analog), or working any kind of technology more complex than a calculator.
For me personally, even light switches have been a huge tell in the past, so basically almost anything electrical.
I've always held the utterly unscientific position that this is because the brain only has enough GPU cycles to show you an approximation of what the dream world looks like, but to actually run a whole simulation behind the scenes would require more FLOPs than it has available. After all, the brain also needs to run the "player" threads: It's already super busy.
Stretching the analogy past the point of absurdity, this is a bit like modern video game optimizations: the mountains in the distance are just a painting on a surface, and the remote on that couch is just a messy blur of pixels when you look at it up close.
So the dreaming brain is like a very clever video game developer, I guess.
- tablatom 3 months ago
  
  Wait, lucid dreamers need tells to know where they are?!?
  
  11 replies →
- BoredomIsFun 3 months ago
  
  My brain learned how to maintain legible text in dreams, I cannot use it in lucid dreaming anymore...
danw1979 3 months ago
For me it’s phones… specifically dialling a number manually. No matter how carefully I dial, the number on the screen is rarely correct.
- allarm 3 months ago
  
  It seems that I’ve been stuck in a lucid dream for a couple of decades, no matter how carefully write text on a phone keyboard it never comes out as intended.
  
  1 reply →
- amelius 3 months ago
  
  Whenever I dial a number while in a dream, the person I'm trying to call always turns out to be right next to me.
  
  1 reply →
biztos 3 months ago

Do they look normal but just not work normally?
Maybe reality is a world of broken clocks, and they only “work” in the simulation.
teaearlgraycold 3 months ago

I feel like the heuristic could just be - do I feel like I’m in a dream? Then I am. I’ve never felt that way when awake.

xrisk 3 months ago

Maybe explainable via the fact that these tests are part of the LLM training set?

jorgesborges 3 months ago

Conceptual deficit is a great failure mode description. The inability to retrieve "meaning" about the clock -- having some understanding about its shape and function but not its intent to convey time to us -- is familiar with a lot of bad LLM output.

BHSPitMonkey 3 months ago

I would think the way humans draw clocks has more in common with image generation models (which probably do a bit better with this task overall) than a language model producing SVG markup, though.

ACCount37 3 months ago

LLMs don't do this because they have "people with dementia draw clocks that way" in their data. They do it because they're similar enough to human minds in function that they often fail in similar ways.

An amusing pattern that dates back to "1kg of steel is heavier of course" in GPT-3.5.

kaffekaka 3 months ago
How do you know this?
Obviously, humans failing in these ways ARE in the training set. So it should definitely affect LLM output.
- ACCount37 3 months ago
  
  First: generalization. The failure modes extend to unseen tasks. That specific way to fail at "1kg of steel" sure was in the training data, but novel closed set logic puzzles couldn't have been. They display similar failures. The same "vibe-based reasoning" process of "steel has heavy vibes, feather has light vibes, thus, steel is heavier" produces other similar failures.
  Second: the failures go away with capability (raw scale, reasoning training, test-time compute), on seen and unseen tasks both. Which is a strong hint that the model was truly failing, rather than being capable of doing a task but choosing to faithfully imitate a human failure instead.
  I don't think the influence of human failures in the training data on the LLMs is nil, but it's not just a surface-level failure repetition behavior.

TheJoeMan 3 months ago

Figure 6 with the square clock would be a cool modern art piece.

yencabulator 3 months ago

I have had this thought of a slow-moving mechanical simulation of a chaotic triple pendulum as a clock hand for a very long time..
Or maybe something like https://www.youtube.com/watch?v=dhZxdV2naw8