← Back to context

Comment by baltimore

3 months ago

Since the first (good) image generation models became available, I've been trying to get them to generate an image of a clock with 13 instead of the usual 12 hour divisions. I have not been successful. Usually they will just replace the "12" with a "13" and/or mess up the clock face in some other way.

I'd be interested if anyone else is successful. Share how you did it!

I've noticed that image models are particularly bad at modifying popular concepts in novel ways (way worse "generalization" than what I observe in language models).

  • Maybe LLMs always fail to generalize outside their data set, and it’s just less noticeable with written language.

    • This is it. They’re language models which predict next tokens probabilistically and a sampler picks one according to the desired ”temperature”. Any generalization outside their data set is an artifact of random sampling: happenstance and circumstance, not genuine substance.

      6 replies →

    • Most image models are diffusion models, not LLMs, and have a bunch of other idiosyncrasies.

      So I suspect it's more that lessons from diffusion image models don't carry over to text LLMs.

      And the Image models which are based on multi-mode LLMs (like Nano Banana) seem to do a lot better at novel concepts.

      2 replies →

    • They definitely don't completely fail to generalise. You can easily prove that by asking them something completely novel.

      Do you mean that LLMs might display a similar tendency to modify popular concepts? If so that definitely might be the case and would be fairly easy to test.

      Something like "tell me the lord's prayer but it's our mother instead of our father", or maybe "write a haiku but with 5 syllables on every line"?

      Let me try those ... nah ChatGPT nailed them both. Feels like it's particular to image generation.

      1 reply →

  • Also, they're fundamentally bad at math. They can draw a clock because they've seen clocks, but going further requires some calculations they can't do.

    For example, try asking Nano Banana to do something simpler, like "draw a picture of 13 circles." It likely will not work.

  Generate an image of a clock face, but instead of the usual 12 hour numbering, number it with 13 hours. 

Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it these days. https://imgur.com/a/1sSeFX7

A normal (ish) 12h clock. It numbered it twice, in two concentric rings. The outer ring is normal, but the inner ring numbers the 4th hour as "IIII" (fine, and a thing that clocks do) and the 8th hour as "VIIII" (wtf).

  • It should be pretty clear already that anything which is based (limited?) to communicating words/text can never grasp conceptual thinking.

    We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

    • > We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

      We have a very comprehensive and precise spec for that [0].

      If you don't want to hop through the certificate warning, here's the transcript:

      - Some day, we won't even need coders any more. We'll be able to just write the specification and the program will write itself.

      - Oh wow, you're right! We'll be able to write a comprehensive and precise spec and bam, we won't need programmers any more.

      - Exactly

      - And do you know the industry term for a project specification that is comprehensive and precise enough to generate a program?

      - Uh... no...

      - Code, it's called code.

      [0]: https://www.commitstrip.com/en/2016/08/25/a-very-comprehensi...

      1 reply →

    • I don’t think that’s clear at all. In fact the proficiency of LLMs at a wide variety of tasks would seem to indicate that language is a highly efficient encoding of human thought, much moreso than people used to think.

      1 reply →

I gave this "riddle" to various models:

> The farmer and the goat are going to the river. They look into the sky and see three clouds shaped like: a wolf, a cabbage and a boat that can carry the farmer and one item. How can they safely cross the river?

Most of them are just giving the result to the well known river crossing riddle. Some "feel" that something is off, but still have a hard time to figure out that wolf, boat and cabbage are just clouds.

That's just a patch to the training data.

Once companies see this starting to show up in the evals and criticisms, they'll go out of their way to fix it.

This is really cool. I tried to prompt gemini but every time I got the same picture. I do not know how to share a session (like it is possible with Chatgpt) but the prompts were

If a clock had 13 hours, what would be the angle between two of these 13 hours?

Generate an image of such a clock

No, I want the clock to have 13 distinct hours, with the angle between them as you calculated above

This is the same image. There need to be 13 hour marks around the dial, evenly spaced

... And its last answer was

You are absolutely right, my apologies. It seems I made an error and generated the same image again. I will correct that immediately.

Here is an image of a clock face with 13 distinct hour marks, evenly spaced around the dial, reflecting the angle we calculated.

And the very same clock, with 12 hours, and a 13th above the 12...

  • This is probably my biggest problem with AI tools, having played around with them more lately.

    "You're absolutely right! I made a mistake. I have now comprehensively solved this problem. Here is the corrected output: [totally incorrect output]."

    None of them ever seem to have the ability to say "I cannot seem to do this" or "I am uncertain if this is correct, confidence level 25%" The only time they will give up or refuse to do something is when they are deliberately programmed to censor for often dubious "AI safety" reasons. All other times, they come back again and again with extreme confidence as they totally produce garbage output.

  • you can click the share icon (the two-way branch icon, it doesn't look like apple's share icon) under the image it generates to share the conversation.

    i'm curious if the clock image it was giving you was the same one it was giving me

    https://gemini.google.com/share/780db71cfb73

    • Thanks for the tip about sharing!

      No, my clock was an old style one, to be put on a shelf. But at least it had a "13" proudly right above the "12" :)

      This reminds me my kids when they were in kindergarden and were bringing home their art that needed extra explanation to realize what it was. But they were very proud!

I was able to have AI generate an image that made this, but not by diffusion/autoregressive but by having it write Python code to create the image.

ChatGPT made a nice looking clock with matplotlib that had some bugs that it had to fix (hours were counter-clockwise). Gemini made correct code one-shot, it used Pillow instead of matplotlib, but it didn't look as nice.

Weird, I never tried that, I tried all the usual tricks that usually work including swearing at the model (this scarily works surprisingly well with LLMs) and nothing. I even tried to go the opposite direction, I want a 6 hour clock.

I do playing card generation and almost all struggle beyond the "6 of X"

My working theory is that they were trained really hard to generate 5 fingers on hands but their counting drops off quickly.

That's because they literally cannot do that. Doing what you're asking requires an understanding of why the numbers on the clock face are where they are and what it would mean if there was an extra hour on the clock (ie that you would have to divide 360 by 13 to begin to understand where the numbers would go). AI models have no concept of anything that's not included in their training data. Yet people continue to anthropomorphize this technology and are surprised when it becomes obvious that it's not actually thinking.

  • The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.

    Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.

    The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.

    Or more realistically, put in the creative and difficult research work required to discover the answer to that question.

  • It's interesting because if you asked them to write code to generate an SVG of a clock, they'd probably use a loop from 1 to 12, using sin and cos of the angle (given by the loop index over 12 times 2pi) to place the numerals. They know how to do this, and so they basically understand the process that generates a clock face. And extrapolating from that to 13 hours is trivial (for a human). So the fact that they can't do this extrapolation on their own is very odd.

  • gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.

    gpt-image-1 and Imagen are wickedly smart.

    The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

    • >> The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

      That's great, but I bet it can't tie it's own shoes.

      2 replies →

  • I wonder if you would have more success if you painstakingly described the shape and features of a clock in great detail but never used the words clock or time or anything that might give the AI the hint that they were supposed to output something like a clock.

    • And this is a problem for me. I guess that it would work, but as soon as the word "clock" appears, gone is the request because a clock HAS.12.HOURS.

      I use this a lot in cybersecurity when I need to do something "illegal". I am refused help, until I say that I am doing research on cybersecurity. In that case no problem.

  • The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.

    • I wonder if it's because we mean different things by generalization.

      For text, "generalization" is still "generate text that conforms to all the usual rules of the language". For images of 13-hour clock faces, we're explicitly asking the LLM to violate the inferred rules of the universe.

      I think a good analogy would be asking an LLM to write in English, except the word "the" now means "purple". They will struggle to adhere to this prompt in a conversation.

      2 replies →

  • Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world

Ah! This is so sad. The manager types won't be able to add an hour (actually, two) to the day even with AI.

I've been trying for the longest time and across models to generate pictures or cartoons of people with six fingers and now they won't do it. They always say they accomplished it, but the result always has 5 fingers. I hate being gaslit.

LLMs are terrible for out-of-distribution (OOD) tasks. You should use chain of thought suppression and give constaints explictly.

My prompt to Grok:

---

Follow these rules exactly:

- There are 13 hours, labeled 1–13.

- There are 13 ticks.

- The center of each number is at angle: index * (360/13)

- Do not infer anything else.

- Do not apply knowledge of normal clocks.

Use the following variables:

HOUR_COUNT = 13

ANGLE_PER_HOUR = 360 / 13 // 27.692307°

Use index i ∈ [0..12] for hour marks:

angle_i = i * ANGLE_PER_HOUR

I want html/css (single file) of a 13-hour analog clock.

---

Output from grok.

https://jsfiddle.net/y9zukcnx/1/