Comment by otterley
3 months ago
Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!
Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.
I’ve been using Kimi K2 a lot this month. Gives me Japanese->English translations at near human levels of quality, while respecting rules and context I give it in a very long, multi-page system prompt to improve fidelity of translation for a given translation target (sometimes markup tags need to be preserved, sometimes deleted, etc.). It doesn’t require a thinking step to generate this level of translation quality, making it suitable for real-time translation. It doesn’t start getting confused when I feed it a couple dozen lines of previous translation context, like certain other LLMs do… instead the translation actually improves with more context instead of degrading. It’s never refused a translation for “safety” purposes either (GPT and Gemini love to interrupt my novels and tell me certain behavior is illegal or immoral, and censor various anatomical words).
> GPT and Gemini love to interrupt my novels and tell me certain behavior is illegal or immoral, and censor various anatomical words
Lol, are you using ai to create fan translations of エロ漫画 ?
それ何のことか全然わからん…冗談だよ。メインはビジュアルノベルとラノベ、たまにエロw
I knew of Kimi K2 because it’s the model used by Kagi to generate the AI answers when query ends with an interrogation point.
It's also one of the few 'recommended' models in Kagi Assistant (multi-model ChatGPT basically, available on paid plans).
Really? They must've switched recently cause that was around before kimi came out
Yes, this is recent. Before it was other model(s), not sure which.
I find that Kimi K2 looks the best, but i've noticed the time is often wrong!
Qwen's clocks are highly entertaining. Like if you asked an alien "make me a clock".
It could be that the prompt is accidentally (or purposefully) more optimised for Kimi K2, or that Kimi K2 is better trained on this particular data. LLM's need "prompt engineers" for a reason to get the most out of a particular model.
How much engineering do prompt engineers do? Is it engineering when you add "photorealistic. correct number of fingers and teeth. High quality." to the end of a prompt?
we should call them "prompt witch doctors" or maybe "prompt alchemists".
I write quite a lot of prompts, and the closest analogy that I can think of is a shaman trying to appease the spirits.
3 replies →
Sure, we are still closer to alchemy than materials science, but its still early days. But consider this blogpost that was on the front page today: https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompt.... The table on the bottom shows a generally steady increase in performance just by iterating on prompts. It feels like we are on the path to true engineering.
2 replies →
we used to just call them "good at googling". I've never met a self-described prompt engineer who had anything close to engineering education and experience. Seems like an extension on the 6-week boot camp == software engineer trend.
I like that actually, I've spent the last year probably 60:40 between post-training and prompt engineering/witch doctoring (the two go together more than most people realize)
Some of it is engineering-like, but I've also picked up a sixth sense when modifying prompts about what parts are affecting the behavior I want to modify for certain models, and that feels very witch doctory!
The more engineering-like part is essentially trying to RE a black box model's post-training, but that goes over some people's heads so I'm happy to help keep the "it's just voodoo and guessing" narrative going instead :)
1 reply →
> we should call them "prompt witch doctors" or maybe "prompt alchemists".
Oh absolutely not! Only in engineering you are allowed to get called an engineer for no apparent reason, do that in other white collar and you are behind the bars because of fraudulent claims.
"...and do it really well or my grandmother will be killed by her kidnappers! And I'll give you a tip of 2 billion dollars!!! Hurry, they're coming!"
7 replies →
Well if it works consistently, I don't see any problem with that. If they have a clear theory of when to add "photorealistic" and when to add "correct number of wheels on the bus" to get the output they want, it's engineering. If they don't have a (falsifiable) theory, it's probably not engineering.
Of course, the service they really provide is for businesses to feel they "do AI", and whether or not they do real engineering is as relevant as if your favorite pornstars' boobs are real or not.
4 replies →
It could be bioengineering if you add that to a clock prompt then connect it to CRISPR process for out putting DNA.
Horrifying prospect, tbh
"How is engineering a real science? You just build the bridge so it doesn't fall down."
4 replies →
I think the selection of models is a bit off. Haiku instead of Sonnet for example. Kimi K2's capabilities are closer to Sonnet than to Haiku. GPT-5 might be in the non-reasoning mode, which routes to a smaller model.
I had my suspicions about the GPT-5 routing as well. When I first looked at it, the clock was by far the best; after the minute went by and everything refreshed, the next three were some of the worst of the group. I was wondering if it just hit a lucky path in routing the first time.
Goes to show the "frontier" is not really one frontier. It's a social/mathematical construct that's useful for a broad comparison, but if you have a niche task, there's no substitute for trying the different models.
Just use something like DSPy/Ax and optimize your module for any given LLM (based on sample data and metrics) and you’re mostly good. No need to manually wordsmith prompts.
It's not fair to use prompts tailored to a particular model when doing comparisons like this - one shot results that generalize across a domain demonstrate solid knowledge of the domain. You can use prompting and context hacking to get any particular model to behave pseudo-competently in almost any domain, even the tiny <1B models, for some set of questions. You could include an entire framework and model for rendering clocks and times that allowed all 9 models to perform fairly well.
This experiment, however, clearly states the goal with this prompt: `Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.`
An LLM should be able to interpret that, and should be able to perform a wide range of tasks in that same style - countdown timers, clocks, calendars, floating quote bubble cycling through list of 100 pithy quotations, etc. Individual, clearly defined elements should have complex representations in latent space that correspond to the human understanding of those elements. Tasks and operations and goals should likewise align with our understanding. Qwen 2.5 and some others clearly aren't modeling clocks very well, or maybe the html/css rendering latents are broken. If you pick a semantic axis(like analog clocks), you can run a suite of tests to demonstrate their understanding by using limited one-shot interactions.
Reasoning models can adapt on the fly, and are capable of cheating - one shots might have crappy representations for some contexts, but after a lot of repetition and refinement, as long as there's a stable, well represented proxy for quality somewhere in the semantics it understands, it can deconstruct a task to fundamentals and eventually reach high quality output.
These type of tests also allow us to identify mode collapses - you can use complex sophisticated prompting to get most image models to produce accurate analog clocks displaying any time, but in the simple one shot tests, the models tend to only be able to produce the time 10:10, and you'll get wild artifacts and distortions if you try to force any other configuration of hands.
Image models are so bad at hands that they couldn't even get clock hands right, until recently anyway. Nano banana and some other models are much better at avoiding mode collapses, and can traverse complex and sophisticated compositions smoothly. You want that same sort of semantic generalization in text generating models, so hopefully some of the techniques cross over to other modalities.
I keep hoping they'll be able to use SAE or some form of analysis on static weight distributions in order to uncover some sort of structural feature of mode collapse, with a taxonomy of different failure modes and causes, like limited data, or corrupt/poisoned data, and so on. Seems like if you had that, you could deliberately iterate on, correct issues, or generate supporting training material to offset big distortions in a model.
Qwen 2.5 is so bad it’s good. Some really insane results if you watch it for a while. Almost like it’s taking the piss.
It would be cool to also AI generate the favicon using some sort of image model.
Kimi K2 is legitimately good.
Perhaps Qwen 2.5 should be known as Dali 2.‽
When I clicked, everything was garbage except Grok and DeepSeek. kimi was the worst clock
>Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.
More like fell headfirst into the ground.
I'm disappointed with Gemini 2.5 (not sure Pro or Flash) -- I've personally had _fantastic_ results with Gemini 2.5 Pro building PWA, especially since the May 2025 "coding update." [0]
[0] https://blog.google/products/gemini/gemini-2-5-pro-updates/
I'm a huge K2 fan, it has a personality that feels very distinct from other models (not syccophantic at all), and is quite smart. Also pretty good at creative writing (tho not 100% slop free).
K2 hosted on groq is pretty crazy for intellgence/second. (Low rate limits still, tho.)
my GPT-40 was 100% perfect on the first click. Since then, garbage. Gemini 2.5 perfect on the 3rd click.
Right as you said that, I checked kimi k2’s “clock” and it was just the ascii art: ¯\_(ツ)_/¯
I wonder if that is some type of fallback for errors querying the model, or k2 actually created the html/css to display that.
i noticed the second hand is off tho. gemini has the most accurate one.
Interestingly, either I'm _hallucinating_ this, or DeepSeek started to consistently show a clock without failures and with good time, where it previously didn't. ...aaand as I was typing this, it barfed a train wreck. Never mind, move along... No, wait, it's good again, no, wait...