The last six months in LLMs, illustrated by pelicans on bicycles

16 hours ago (simonwillison.net)

192 comments

swyx

> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.

Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.

azinman2 10 hours ago
Except this went very mainstream. Lots of turn myself into a muppet, what is the human equivalent for my dog, etc. TikTok is all over this.
It really is incredible.
- thierrydamiba 10 hours ago
  
  The big trend was around the ghiblification of images. Those images were everywhere for a period of time.
  
  16 replies →
- pinoy420 15 minutes ago
  
  [dead]
haiku2077 8 hours ago
Congratulations, you are almost fully unplugged from social media. This product launch was a huge mainstream event; for a few days GPT generated images completely dominated mainstream social media.
- derwiki 5 hours ago
  
  Not sure if this is sarcasm or sincere, but I will take it as sincere haha. I came back to work from parental leave and everyone had that same Studio Ghiblized image as their Slack photo, and I had no idea why. It turns out you really can unplug from social media and not miss anything of value: if it’s a big enough deal you will find out from another channel.
  
  1 reply →
MattRix 4 hours ago

To be clear: they already had image generation in ChatGPT, but this was a MUCH better one than what they had previously. Even for you with your stable diffusion app, it would be a significant upgrade. Not just because of image quality, but because it can actually generate coherent images and follow instructions.
bufferoverflow 6 hours ago
Have you missed how everyone was Ghiblifying everything?
- adrian17 4 hours ago
  
  I saw that, I just didn't connect it with newly added multimodal image generation. I knew variations of style transfer (or LoRA for SD) were possible for years, so I assumed it exploded in popularity purely as a meme, not due to OpenAI making it much more accessible.
  Again, I was aware that they added image generation, just not how much of a deal it turned out to be. Think of it like me occasionally noticing merchandise and TV trailers for a new movie without realizing it became the new worldwide box office #1.
- andrepd 2 hours ago
  
  Oh you mean the trend of the day on the social media monoculture? I don't take that as an indicator of any significance.
  
  1 reply →

isx726552 6 hours ago

> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

simonw 4 hours ago

Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.
Choco31415 4 hours ago

Just tried that canard on GPT-4o and it failed:
"The word "strawberry" contains 2 letter r’s."
MattRix 4 hours ago

This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org

nathan_phoenix 13 hours ago

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

simonw 12 hours ago
It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.
- demosthanos 11 hours ago
  
  I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.
  Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
  
  10 replies →
- fzzzy 11 hours ago
  
  Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.
  
  9 replies →
- dilap 9 hours ago
  
  Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!
- ontouchstart 11 hours ago
  
  Very nice talk, acceptable by general public and by AI agent as well.
  Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
  Your talk might influence the funding of AI startups.
  #butterflyEffect
  
  1 reply →
planb 13 hours ago
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
- criddell 12 hours ago
  
  And that’s why he says he’s going to have to find a new benchmark.
- viraptor 11 hours ago
  
  Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.
  I actually don't think I've seen a single correct svg drawing for that prompt.
- cyanydeez 12 hours ago
  
  So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.
  Call it wikipediaslop.org
  
  1 reply →
puttycat 13 hours ago
You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
- ben_w 13 hours ago
  
  > In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
  Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
  
  11 replies →
- bufferoverflow 6 hours ago
  
  > work discretely like humans
  What kind of humans are you surrounded by?
  Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.
- cyanydeez 12 hours ago
  
  Humans absolutely do not work discretely.
  
  2 replies →
mooreds 11 hours ago
My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.
I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.
Other ways:
* wisdom of the crowds (have people vote on it)
* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)
* wisdom of the LLMs (use more than one LLM)
Would have been neat to see what the human consensus was and if it differed from the LLM consensus
Anyway, great talk!
- zahlman 7 hours ago
  
  It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....
timewizard 6 hours ago

My biggest gripe is he didn't include a picture of an actual pelican.
https://www.google.com/search?q=pelican&udm=2
The "closest pelican" is not even close.
qeternity 11 hours ago
I think you mean non-deterministic, instead of probabilistic.
And there is no reason that these models need to be non-deterministic.
- skybrian 10 hours ago
  
  A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.
  So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
- rvz 10 hours ago
  
  > I think you mean non-deterministic, instead of probabilistic.
  My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".
cyanydeez 12 hours ago

I get my pelicans from google and my raw dogs from openAI, while the best fundamental fascist ideologies are best sourced from GrokAI.

zurichisstained 5 hours ago

Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:

``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```

But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.

It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.

I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).

https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo

https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7

https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro

Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.

(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m

ojosilva 1 hour ago

Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.
dr_kretyn 4 hours ago
I'm slightly confused by your example. What's the actual prompt? Is your expectation that a text model is going to know how to perform the exact song in audio?
- zurichisstained 1 hour ago
  
  Ohhh absolutely not, that would be pretty wild - I just wanted to see if it could understand musical notation enough to come up with the correct melody.
  I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.
  My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).

bredren 10 hours ago

Great writeup.

This measure of LLM capability could be extended by taking it into the 3D domain.

That is, having the model write Python code for Blender, then running blender in headless mode behind an API.

The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)

So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.

For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.

For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.

I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.

irthomasthomas 1 hour ago

The best pelicans come from running a consortium of models. I use pelicans as evals now. https://x.com/xundecidability/status/1921009133077053462 Test it using VibeLab (wip) https://x.com/xundecidability/status/1926779393633857715

NohatCoder 4 hours ago

If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.

I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.

simonw 3 hours ago

Yeah, that's a good call out: Elo isn't actually necessary if you can have every competitor battle every other competitor exactly once.
The missing match is because one single round was declared a draw by the model, and I didn't have time to run it again (the Elo stuff was very much rushed at the last minute.)

joshstrange 13 hours ago

I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

blackhaj7 10 hours ago
Same sentiment!
- dotemacs 6 hours ago
  
  The same here.
  Because of him, I installed a RSS reader so that I don't miss any of his posts. And I know that he shares the same ones across Twitter, Mastodon & Bsky...

anon373839 13 hours ago

Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).

simonw 12 hours ago

Omitting Qwen 3 is my great regret about this talk. Honestly I only realized I had missed it after I had delivered the talk!
It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.
Maxious 12 hours ago

Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/

franze 11 hours ago

Here Claude Opus Extended Thinking https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c...

ramesh31 10 hours ago
Single shot?
- franze 9 hours ago
  
  2 shot, first one did just generate the svg not the shareable html page around it. in the second go it also worked on the svg as i did not forbid it.

joshuajooste05 7 hours ago

Does anyone have any thoughts on privacy/safety regarding what he said about GPT memory.

I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.

Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?

threeseed 3 hours ago

I've had Cursor/Claude try to call rm -rf on my entire User directory before.
The issue is that LLMs have no ability to organise their memory by importance. Especially as the context size gets larger.
So when they are using tools they will become more dangerous over time.
ActorNightly 6 hours ago

Sort of. The thing is with agentic models, you are basically entering probability space where it can do real actions in the form of http requests if the statistical output leads it to it.

landgenoot 10 hours ago

If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.

diggan 10 hours ago
Lets give it a try, if you're willing to be the experiment subject :)
The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/
I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.
- zahlman 7 hours ago
  
  > The colors use traditional bicycle brown (#8B4513) and a classic blue for the pelican (#4169E1) with gold accents for the beak (#FFD700).
  The output pelican is indeed blue. I can't fathom where the idea that this is "classic", or suitable for a pelican, could have come from.
  
  2 replies →
- mormegil 8 hours ago
  
  Did the testing prompt for LLMs include a clause forbidding the use of any tools? If not, why are you adding it here?
  
  2 replies →
ramesh31 10 hours ago

>If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.
It certainly would, and it would cost at minimum an hour of the human programmer's time at $50+/hr. Claude does it in seconds for pennies.
hae7eepa9eeY 2 hours ago

[dead]

qwertytyyuu 13 hours ago

https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?

puttycat 13 hours ago
The bicycle are still very far from actual ones.
- pjs_ 8 hours ago
  
  https://www.gianlucagimini.it/portfolio-item/velocipedia/
- simonw 12 hours ago
  
  I think the most recent Gemini Pro bicycle may be the best yet - the red frame is genuinely the right shape.
  
  1 reply →

Joker_vD 7 hours ago

> most people find it difficult to remember the exact orientation of the frame.

Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?

I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.

pbronez 6 hours ago

There are a lot of structural details that people tend to gloss over. This was illustrated by an Italian art project:
https://www.gianlucagimini.it/portfolio-item/velocipedia/
> back in 2009 I began pestering friends and random strangers. I would walk up to them with a pen and a sheet of paper asking that they immediately draw me a men’s bicycle, by heart. Soon I found out that when confronted with this odd request most people have a very hard time remembering exactly how a bike is made.

djherbis 8 hours ago

Kaggle recently ran a competition to do just this (draw SVGs from prompts, using fairly small models under the hood).

The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...

nowayno583 10 hours ago

That was a very fun recap, thanks for sharing. It's easy to forget how much better these things have gotten. And this was in just six months! Crazy!

zahlman 7 hours ago

> If you lost interest in local models—like I did eight months ago—it’s worth paying attention to them again. They’ve got good now!

> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.

You reap what you sow....

> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.

Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.

> ... but prompt injection is still a thing.

...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.

> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.

People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.

simonw 7 hours ago
I had the LLM write a bash script for me that used my https://shot-scraper.datasette.io/ tool - on the basis that it was a neat opportunity to demonstrate another of my own projects.
And honestly, even with LLM assistance getting Image Magick to output a 1200x600 image with two SVGs next to each other that are correctly resized to fill their half of the image sounds pretty tricky. Probably easier (for Claude) to achieve with HTML and CSS.
- voiper1 6 hours ago
  
  Isn't "left or right" _followed_ by rationale asking it to rationalize it's 1 word answer - I thought we need to get AI to do the chain of though _before_ giving it's answer for it to be more accurate?
  
  1 reply →
- zahlman 6 hours ago
  
  > And honestly, even with LLM assistance getting Image Magick to output a 1200x600 image with two SVGs next to each other that are correctly resized to fill their half of the image sounds pretty tricky.
  FWIW, the next project I want to look at after my current two, is a command-line tool to make this sort of thing easier. Likely featuring some sort of Lisp-like DSL to describe what to do with the input images.

nine_k 8 hours ago

Am I the only one who can't but see these attempts much like attempts of a kid learning to draw?

Ygg2 8 hours ago
Yes. Kids don't draw that good of a line at the start.
Here is better example of start https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA...
- nine_k 7 hours ago
  
  Have you tried giving a kid a vector-drawing tool?
  I did that to my daughter when she was not even 6 years old. The results were somehow similar: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8
  (Now she's much better, but prefers raster tools, e.g. https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gear-...)

jfengel 9 hours ago

It's not so great at bicycles, either. None of those are close to rideable.

But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.

mattlondon 7 hours ago

Most recent Gemini 2.5 one looks pretty good. Certainly rideable.

m3047 1 hour ago

TIL: Snitchbench!

mromanuk 10 hours ago

The last animation is hilarious, represents very well the AI Hype cycle vs reality.

atxtechbro 11 hours ago

Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings and this is great too! I like the personalized benchmark. Hopefully the big LLM providers don't start gaming the pelican index!

spaceman_2020 11 hours ago

I don’t know what secret sauce Anthropic has, but in real world use, Sonnet is somehow still the best model around. Better than Opus and Gemini Pro

diggan 10 hours ago

Statements like these are useless without sharing exactly all the models you've tried. Sonnet beats O1 Pro Mode for example? Not in my experience, but I haven't tried the latest Sonnet versions, only the one before, so wouldn't claim O1 Pro Mode beats everything out there.
Besides, it's so heavily context-dependent that you really need your own private benchmarks to make head or tails out of this whole thing.

wohoef 11 hours ago

Quite a detailed image using claude sonnet 4: https://ibb.co/39RbRm5W

JimDabell 13 hours ago

See also: The recent history of AI in 32 otters

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

pbhjpbhj 12 hours ago

That is otterly fantastic. The post there shows the breadth too - both otters generated via text representations (in TikZ) and by image generators. The video at the end, wow (and funny too).
Thanks for sharing.

pier25 9 hours ago

Definitely getting better but even the best result is not very impressive.

big_hacker 11 hours ago

Honestly the metric which increased the most is the marketing and astroturfing budget of the major players (OpenAI, Anthropic, Google and Deepseek).

Say what you want about Facebook but at least they released their flagship model fully open.

mdaniel 6 hours ago

> model fully open.
uh-huh https://www.llama.com/llama4/license/

bravesoul2 13 hours ago

Is there a good model (any architecture) for vector graphics out of interest?

simonw 12 hours ago
I was impressed by Recraft v3, which gave me an editable vector illustration with different layers - https://simonwillison.net/2024/Nov/15/recraft-v3/ - but as I understand it that one is actually still a raster image generator with a separate step to convert to vector at the end.
- bravesoul2 12 hours ago
  
  Now that is a pelican on a bicycle! Thanks

username223 7 hours ago

Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:

"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"

A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.

NicoSchwandner 5 hours ago

Nice post, thanks!

neepi 14 hours ago

My only take home is they are all terrible and I should hire a professional.

jug 11 hours ago
Before that, you might ask ChatGPT to create a vector image of a pelican riding a bicycle and then running the output through a PNG to SVG converter...
Result: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican...
These are tough benchmarks to trial reasoning by having it _write_ an SVG file by hand and understanding how it's to be written to achieve this. Even a professional would struggle with that! It's _not_ a benchmark to give an AI the best tools to actually do this.
- YuccaGloriosa 7 hours ago
  
  I think you made an error there png is a bitmap format
  
  1 reply →
keiferski 13 hours ago
As the other guy said, these are text models. If you want to make images use something like Midjourney.
Promoting a pelican riding a bicycle makes a decent image there.
- keiferski 10 hours ago
  
  * Prompting
vunderba 5 hours ago

This test isn't really about the quality of the image itself (multimodals like gpt-image-1 or even standard diffusion models would be far superior) - it's about following a spec that describes how to draw.
A similar test would be if you asked for the pelican on a bicycle through a series of LOGO instructions.
spaceman_2020 11 hours ago

My only take home is that a spanner can work as a hammer, but you probably should just get a hammer
GaggiX 11 hours ago

An expert at writing SVGs?
dist-epoch 13 hours ago
Most of them are text-only models. Like asking a person born blind to draw a pelican, based on what they heard it looks like.
- neepi 13 hours ago
  
  That seems to be a completely inappropriate use case?
  I would not hire a blind artist or a deaf musician.
  
  14 replies →
matkoniecz 13 hours ago
it depends on quality you need and your budget
- neepi 13 hours ago
  
  Ah yes the race to the bottom argument.
  
  3 replies →

deadbabe 11 hours ago

As a control, he should go on fiver and have a human generate a pelican riding a bicycle, just to see what the eventual goal is.

gus_massa 11 hours ago
Someone did this. Look at this sibling comment by ben_w https://news.ycombinator.com/item?id=44216284 about an old similar project.
- zahlman 6 hours ago
  
  > back in 2009 I began pestering friends and random strangers. I would walk up to them with a pen and a sheet of paper asking that they immediately draw me a men’s bicycle, by heart.
  Someone commissioned to draw a bicycle on Fiverr would not have to rely on memory of what it should look like. It would take barely any time to just look up a reference.

dirtyhippiefree 12 hours ago

Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

gscott 3 hours ago

I am interested in this ratting you out thing. At some point you have a video feed into AI from a Jarvis like headset device, you walking down the street and cross the street in the middle not at a sidewalk... does it rat you out? Does it make a list of every crime no matter how small? Or just the big ones?
ben_w 12 hours ago
I'd say that's too short.
> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.
> It turns out nearly all of the models do the same thing.
- dirtyhippiefree 11 hours ago
  
  I totally agree, but I needed you to post the other half because of TL;DR…
yubblegum 11 hours ago

I was looking at that and wondering about swatting via LLMs by malicious users.

tazjin 2 hours ago

[flagged]