Comment by romaniv

5 days ago

> there’s zero chance any AI lab would train a model for such a ridiculous task.

A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.

100% marketing, 0% science.

[1] https://arxiv.org/pdf/2303.12712

For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].

[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...

[1] I'm sure there is because of Simon's fame

My own informal test when generative AI came out has been "a picture of an old man riding a bicycle over a river". I just ran it for chatgpt with the standard model I have (5.5). It shows the old man on an old bicycle with the bicycle on a slack line and the slack line extending over the river with a medieval village in the background.

The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.

So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.

So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.

  • > but they still fail add the assumptions that people would draw.

    I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?

    Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?

    • I would want to know the LLM has a reliable and realistic World Model underneath all of the next token prediction.

      Whether I am building hardened engineering systems, or discussing cooking methods, or discussing sensitive health concerns, or navigating complex psychological and interpersonal issues, the model will inevitably have to make some assumptions about context I haven’t provided. I want to know that those assumptions are grounded in reality.

      For what it’s worth, a slack-line over a river in front of a medieval town is too anachronistic to be interesting, let alone the idea of an old man riding a bicycle well enough over a slack-line. That is output that was not grounded in a solid world model, regardless of how “creative” it was.

    • Well, if Rene Magritte or some similar artist produces a man riding a bicycle over a tightrope, he's being because he knows what people expect from "a man riding a bicycle over a river" but I think the machine doesn't know the normal expectations and so it's not being creative, just failing. A splatter sheet of an industrial painting operation may look like a Jackson Pollock print. The hired painters might even notice this after their shift. But if the process that produces this is just painting tractors, it's not creative either.

    • I think the point is that language is compressed. There's a lot conveyed in very little. Yes, it is ambiguous, but that's exactly the feature that makes natural language useful. It's also why it is so much easier to speak with your friends than it is with some random person in your town, you've learned how to compress and decompress each other's language better.

      But that's also why we invented formal languages like math and programming. Because there's a lot of times where we don't want ambiguity. Law is basically mankind's greatest attempt at making natural language unambiguous and it doesn't take a genius to realize that that's a shitshow and never going to happen. At the end of the day, to make natural language even relatively low in ambiguity requires a metric fuck ton more words than it would take to express via a formal language (which are also overly pedantic and verbose)

      So the problem is that the AI doesn't share those expected decompression strategies. Sure, many humans won't either, but developing a shared language is essential for properly communicating with others. We've all worked with someone who feels like they're speaking a different language. It's exhausting, right?

    • Reminds me of that dad teaching their kids programming by preparing a PB sandwich [^0].

      Solvers are generally really good at bending your rules, but in a context where you want that. An outlaw rule-bending maniac is not what I want from a helpful agent.

      [^0]: https://www.youtube.com/watch?v=mrmqRoRDrFg