Comment by IAmGraydon

3 months ago

That's because they literally cannot do that. Doing what you're asking requires an understanding of why the numbers on the clock face are where they are and what it would mean if there was an extra hour on the clock (ie that you would have to divide 360 by 13 to begin to understand where the numbers would go). AI models have no concept of anything that's not included in their training data. Yet people continue to anthropomorphize this technology and are surprised when it becomes obvious that it's not actually thinking.

13 comments

IAmGraydon

energy123 3 months ago

The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.

Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.

The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.

Or more realistically, put in the creative and difficult research work required to discover the answer to that question.

bobbylarrybobby 3 months ago

It's interesting because if you asked them to write code to generate an SVG of a clock, they'd probably use a loop from 1 to 12, using sin and cos of the angle (given by the loop index over 12 times 2pi) to place the numerals. They know how to do this, and so they basically understand the process that generates a clock face. And extrapolating from that to 13 hours is trivial (for a human). So the fact that they can't do this extrapolation on their own is very odd.

echelon 3 months ago

gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.

gpt-image-1 and Imagen are wickedly smart.

The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

phkahler 3 months ago
>> The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.
That's great, but I bet it can't tie it's own shoes.
- echelon 3 months ago
  
  No, but I can get it to do a lot of work.
  It's a part of my daily tool box.
- esafak 3 months ago
  
  And a submarine can't swim. Big deal.

ryandrake 3 months ago

I wonder if you would have more success if you painstakingly described the shape and features of a clock in great detail but never used the words clock or time or anything that might give the AI the hint that they were supposed to output something like a clock.

BrandoElFollito 3 months ago

And this is a problem for me. I guess that it would work, but as soon as the word "clock" appears, gone is the request because a clock HAS.12.HOURS.
I use this a lot in cybersecurity when I need to do something "illegal". I am refused help, until I say that I am doing research on cybersecurity. In that case no problem.

Workaccount2 3 months ago

The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.

chemotaxis 3 months ago
I wonder if it's because we mean different things by generalization.
For text, "generalization" is still "generate text that conforms to all the usual rules of the language". For images of 13-hour clock faces, we're explicitly asking the LLM to violate the inferred rules of the universe.
I think a good analogy would be asking an LLM to write in English, except the word "the" now means "purple". They will struggle to adhere to this prompt in a conversation.
- Workaccount2 3 months ago
  
  That's true, but I think humans would stumble a lot too (try reading old printed text from the 18fh cenfury where fhey used "f" insfead of t in prinf, if's a real frick fo gef frough).
  However humans are pretty adept at discerning images, even ones outside the norm. I really think there is some kind of architectural block hampering transformers ability to really "see" images. For instance if you show any model a picture of a dog with 5 legs (a fifth leg photoshopped to it's belly) they all say there are only 4 legs. And will argue with you about it. Hell GPT-5 even wrote a leg detection script in python (impressive) which detected the 5 legs, and then it said the script was bugged, and modified the parameters until one of the legs wasn't detected, lol.
  
  1 reply →

godelski 3 months ago

Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world