← Back to context

Comment by roywiggins

1 day ago

My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.

Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.

I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!

Most models don’t have good spatial information from the images. Gemini models do preprocessing and so are typically better for that. It depends a lot on how things get segmented though.

But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?

  • There would be no point in going via an LLM then, if I had a specialist model ready I'd just invoke it on the images directly. I don't particularly need or want a chatbot for this.

That's fairly unfair comparison. Did you include in the prompt a basic set of instructions about which way is "correct" and what to look for?

  • I didn't give a detailed explanation to the model, but I should have been more clear: they all seemed to know what to look for, they wrote explanations of what they were looking for, which were generally correct enough. They still got the answer wrong, hallucinating the locations of the anatomical features they insisted they were looking at.

    It's something that you can solve by just treating the brain as roughly egg-shaped and working out which way the pointy end is, or looking for the very obvious bilateral symmetry. You don't really have to know what any of the anatomy actually is.

This might be showing bugs in the training data. It is common to augment image data sets with mirroring, which is cheap and fast.

and then, in a different industry, one that has physical factories, there's this obsession about getting really good at making the machine that makes the machine (product) being the route to success. So it's funny that LLMs being able to write programs to do the thing you want is seen as a failure here.

What is the “normal” way? Is that defined in a technical specification? Did you provide the definition/description of what you mean by “normal”?

I would not have expected a language model to perform well on what sounds like a computer vision problem? Even if it was agentic, as you also imply how a five year old could learn how to do it, so too an AI system would need to be trained or at the very least be provided with a description of what is looking at.

Imagine you took an MRI brain scan back in time and showed it to a medical Doctor in even the 1950s or maybe 1900. Do you think they would know what the normal orientation is, let alone what they are looking at?

I am a bit confused and also interested in how people are interacting with AI in general, it really seems to have a tendency to highlight significant holes in all kinds of human epistemological, organizational, and logical structures.

I would suggest maybe you think of it as a kind of child, and with that, you would need to provide as much context and exact detail about the requested task or information as possible. This is what context engineering (are we still calling it that?) concerns itself with.

  • The models absolutely do know what the standard orientation is for a scan. They respond extensively about what they're looking for and what the correct orientation would be, more or less accurately. They are aware.

    They then give the wrong answer, hallucinating anatomical details in the wrong place, etc. I didn't bother with extensive prompting because it doesn't evince any confusion on the criteria, it just seems to not understand spatial orientations very well, and it seemed unlikely to help.

    The thing is that it's very, very simple: an axial slice of a brain is basically egg-shaped. You can work out whether it's pointing vertically (ie, nose pointing to towards the top of the image) or horizontally by looking at it. LLMs will insist it's pointing vertically when it isn't. it's an easy task for someone with eyes.

    Essentially all images an LLM will have seen of brains will be in this orientation, which is either a help or a hindrance, and I think in this case a hindrance- it's not that it's seen lots of brains and doesn't know which are correct, it's that it has only ever seen them in the standard orientation and it can't see the trees for the forest, so to speak.