Comment by irthomasthomas
2 days ago
llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"
Here's a classic ASCII art representation of SpongeBob SquarePants:
.------.
/ o o \
| |
| \___/ |
\_______/
llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
"An ASCII of spongebob"
Here's an ASCII art of SpongeBob SquarePants:
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..--.' /
|(_.' | / .-\-. \---.
\ 0| | ( O| O) | |
| _ | .--.____.'._.-. |
\ (_) | o -` .-` |
| \ |`-._ _ _ _ _\ /
\ | | `. |_||_| |
| o | \_ \ | -. .-.
|.-. \ `--..-' O | `.`-' .'
_.' .' | `-.-' /-.__ ' .-'
.' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
`-._ `. |________/\_____| `-.'
.' ).| '=' '='\/ '=' |
`._.` '---------------'
//___\ //___\
|| ||
||_.-. ||_.-.
(_.--__) (_.--__)
Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?
Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.
memorized: https://www.asciiart.eu/cartoons/spongebob-squarepants
Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.
We don't value LLMs for rote memorization though. Perfect memorization is a long solved task. We value LLMs for their generalization capabilities.
A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.
One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.
I'd argue that actually, the smaller model is doing a better job at "learning" - in that it's including key characteristics within an ascii image while poor.
The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
Maybe a pelican riding a bike as ascii for both?
> That's how LLMs work
And that is also exactly how we want them not to work: we want them to be able to solve new problems. (Because Pandora's box is open, and they are not sold as a flexible query machine.)
"Where was Napoleon born": easy. "How to resolve the conflict effectively": hard. Solved problems are interesting to students. Professionals have to deal with non trivial ones.
2 replies →
Not really.
Typically less than 1% of training data is memorized.
For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?
Conveniently removed the artist's signature though.
Yes - they all do that. Actually, most attempts start well but unravel toward the end.
4 replies →
Certainly not defending LLMs here, don't mistake with that.
Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
1 reply →
I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII. kimi K2: Here’s a classic ASCII art of SpongeBob SquarePants for you:
Enjoy your SpongeBob ASCII!
For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.
They don't see runs of spaces very well, so most of them are terrible at ASCII art. (They'll often regurgitate something from their training data rather than try themselves.)
And unless their terminal details are included in the context, they'll just have to guess.
1 reply →
I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence
With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.
Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:
Model: chutes/Qwen/Qwen3-235B-A22B
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507
Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking
Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507