Comment by haolez

1 day ago

> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.

This makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content "more deeply" will have any impact on whatever output the LLM generates.

116 comments

haolez

nostrademons 1 day ago

It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying "Actually, if you go into the details of..." or "If you look at the intricacies of the problem" or "If you understood the problem deeply" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.

Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across.

manmal 1 day ago
I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.
Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data.
- spagettnet 1 day ago
  
  Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses.
  
  1 reply →
xscott 1 day ago
Of course I can't be certain, but I think the "mixture of experts" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.
Just a theory.
- victorbjorklund 1 day ago
  
  Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.
  So if you send a python code then the first one in function can be one expert, second another expert and so on.
  
  3 replies →
r0b05 1 day ago

This is such a good explanation. Thanks
hbarka 1 day ago
>> Same reason that "Pretend you are an MIT professor" or "You are a leading Python expert" or similar works in prompts.
This pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.
A brief purpose statement describing what the skill [skill.md] does is more honest and just as effective.
- rescbr 20 hours ago
  
  I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task.
dakolli 1 day ago
You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.
These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.
- nubg 1 day ago
  
  Your ignorance is my opportunity. May I ask which markets you are developing for?
  
  2 replies →

FuckButtons 1 day ago

That’s because it’s superstition.

Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.

Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.

The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.

If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

stingraycharles 1 day ago
I actually have a prompt optimizer skill that does exactly this.
https://github.com/solatis/claude-config
It’s based entirely off academic research, and a LOT of research has been done in this area.
One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.
“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”
https://arxiv.org/abs/2307.11760
- bavell 18 hours ago
  
  Thanks for sharing! I've been gravitating towards this sort of workflow already - just seems like the right approach for these tools.
majormajor 1 day ago
> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.
"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."
So I'm 90% sure this is already happening on some level.
- GrinningFool 21 hours ago
  
  But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "
- FuckButtons 17 hours ago
  
  Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.
onion2k 1 day ago

i suppose we will just have to write an English to pedantry compiler.
A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.
rzmmm 1 day ago

I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".
imiric 1 day ago
> That’s because it’s superstition.
This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.
It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]
[1]: https://news.ycombinator.com/item?id=47034087
- oblio 14 hours ago
  
  > This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.
  Oh, the blasphemy!
  So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)
  
  1 reply →

jcdavis 1 day ago

Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.

Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running

(1 Outside of some core ML developers at the big model companies)

harrall 1 day ago

It’s like playing a fretless instrument to me.
Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.
Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.
klipt 1 day ago
Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes.
- silversmith 1 day ago
  
  Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!
  
  2 replies →
chickensong 1 day ago
For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions.
- glerk 1 day ago
  
  Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less.
  
  2 replies →
- joshmn 1 day ago
  
  Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on.
  
  1 reply →
- trueno 1 day ago
  
  wait seriously? lmfao
  thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.
  if there's a source for that i'd love to read about it.
  
  6 replies →

scuff3d 1 day ago

How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology.

energy123 1 day ago
Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...
It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).
It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge.
- gehsty 1 day ago
  
  I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this.
  
  2 replies →
cloudbonsai 1 day ago
The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.
I'm not being sarcastic. This is absolutely incredible.
- intrasight 20 hours ago
  
  And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working.
yawnr 13 hours ago

Nice to hear someone say it. Like what are we even doing? It's exhausting.
sumedh 21 hours ago

We have tests and benchmarks to measure it though.
fragmede 1 day ago
Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!
- scuff3d 1 day ago
  
  That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.
  Without something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it.
  
  12 replies →

hashmap 1 day ago

these sort-of-lies might help:

think of the latent space inside the model like a topological map, and when you give it a prompt, you're dropping a ball at a certain point above the ground, and gravity pulls it along the surface until it settles.

caveat though, thats nice per-token, but the signal gets messed up by picking a token from a distribution, so each token you're regenerating and re-distorting the signal. leaning on language that places that ball deep in a region that you want to be makes it less likely that those distortions will kick it out of the basin or valley you may want to end up in.

if the response you get is 1000 tokens long, the initial trajectory needed to survive 1000 probabilistic filters to get there.

or maybe none of that is right lol but thinking that it is has worked for me, which has been good enough

noduerme 1 day ago

Hah! Reading this, my mind inverted it a bit, and I realized ... it's like the claw machine theory of gradient descent. Do you drop the claw into the deepest part of the pile, or where there's the thinnest layer, the best chance of grabbing something specific? Everyone in everu bar has a theory about claw machines. But the really funny thing that unites LLMs with claw machines is that the biggest question is always whether they dropped the ball on purpose.
The claw machine is also a sort-of-lie, of course. Its main appeal is that it offers the illusion of control. As a former designer and coder of online slot machines... totally spin off into pages on this analogy, about how that illusion gets you to keep pulling the lever... but the geographic rendition you gave is sort of priceless when you start making the comparison.
basch 1 day ago
My mental model for them is plinko boards. Your prompt changes the spacing between the nails to increase the probability in certain directions as your chip falls down.
- hashmap 1 day ago
  
  i literally suggested this metaphor earlier yesterday to someone trying to get agents to do stuff they wanted, that they had to set up their guardrails in a way that you can let the agents do what they're good at, and you'll get better results because you're not sitting there looking at them.
  i think probably once you start seeing that the behavior falls right out of the geometry, you just start looking at stuff like that. still funny though.

Betelbuddy 1 day ago

Its very logical and pretty obvious when you do code generation. If you ask the same model, to generate code by starting with:

- You are a Python Developer... or - You are a Professional Python Developer... or - You are one of the World most renowned Python Experts, with several books written on the subject, and 15 years of experience in creating highly reliable production quality code...

You will notice a clear improvement in the quality of the generated artifacts.

gehsty 1 day ago
Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully).
I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing.
- hu3 20 hours ago
  
  Maybe, but forcing code generation in a certain way could ruin hello worlds and simpler code generation.
  Sometimes the user just wants something simple instead of enterprise grade.
  
  1 reply →
obiefernandez 1 day ago
My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/
- bavell 18 hours ago
  
  Haha, this reminds me of all the stable diffusion "in the style of X artist" incantations.
haolez 1 day ago
That's different. You are pulling the model, semantically, closer to the problem domain you want it to attack.
That's very different from "think deeper". I'm just curious about this case in specific :)
- argee 1 day ago
  
  I don't know about some of those "incantations", but it's pretty clear that an LLM can respond to "generate twenty sentences" vs. "generate one word". That means you can indeed coax it into more verbosity ("in great detail"), and that can help align the output by having more relevant context (inserting irrelevant context or something entirely improbable into LLM output and forcing it to continue from there makes it clear how detrimental that can be).
  Of course, that doesn't mean it'll definitely be better, but if you're making an LLM chain it seems prudent to preserve whatever info you can at each step.

computomatic 1 day ago

If I say “you are our domain expert for X, plan this task out in great detail” to a human engineer when delegating a task, 9 times out of 10 they will do a more thorough job. It’s not that this is voodoo that unlocks some secret part of their brain. It simply establishes my expectations and they act accordingly.

To the extent that LLMs mimic human behaviour, it shouldn’t be a surprise that setting clear expectations works there too.

stingraycharles 1 day ago

It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles.

Scrapemist 1 day ago
Where can I find those?
- stingraycharles 1 day ago
  
  This analysis is a good starting point: https://southbridge-research.notion.site/Prompt-Engineering-...

BloondAndDoom 17 hours ago

Why do you think that? Given how the attention and optimization works on training and inference it makes sense that these kind of words trigger deeper analysis (more steps, introducing more thinking/reasoning steps which wield indeed yield less problems. Even if you make model to spend more time on token outputting you will have more opportunity to emerge better reasoning in between.

At least this is how I understand it how LLMs work.

Possibly can be confirmed something with tools this : https://www.neuronpedia.org/

giancarlostoro 1 day ago

The LLM will do what you ask it to unless you don't get nuanced about it. Myself and others have noticed that LLM's work better when your codebase is not full of code smells like massive godclass files, if your codebase is discrete and broken up in a way that makes sense, and fits in your head, it will fit in the models head.

ambicapter 1 day ago

Maybe the training data that included the words like "skim" also provided shallower analysis than training that was close to the words "in great detail", so the LLM is just reproducing those respective words distribution when prompted with directions to do either.

winwang 1 day ago

Apparently LLM quality is sensitive to emotional stimuli?

"Large Language Models Understand and Can be Enhanced by Emotional Stimuli": https://arxiv.org/abs/2307.11760

ChadNauseam 1 day ago

The disconnect might be that there is a separation between "generating the final answer for the user" and "researching/thinking to get information needed for that answer". Saying "deeply" prompts it to read more of the file (as in, actually use the `read` tool to grab more parts of the file into context), and generate more "thinking" tokens (as in, tokens that are not shown to the user but that the model writes to refine its thoughts and improve the quality of its answer).

computerex 1 day ago

It is as the author said, it'll skim the content unless otherwise prompted to do so. It can read partial file fragments; it can emit commands to search for patterns in the files. As opposed to carefully reading each file and reasoning through the implementation. By asking it to go through in detail you are telling it to not take shortcuts and actually read the actual code in full.

Affric 1 day ago

My guess would be that there’s a greater absolute magnitude of the vectors to get to the same point in the knowledge model.

wrs 1 day ago

The original “chain of thought” breakthrough was literally to insert words like “Wait” and “Let’s think step by step”.

wilkystyle 1 day ago

The author is referring to how the framing of your prompt informs the attention mechanism. You are essentially hinting to the attention mechanism that the function's implementation details have important context as well.

fragmede 1 day ago

Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do!

itypecode 1 day ago
Solid dad move. XD
- wilkystyle 1 day ago
  
  Is parenting making us better at prompt engineering, or is it the other way around?
  
  2 replies →
bpodgursky 1 day ago

You bumped the token predictor into the latent space where it knew what it was doing : )
optimalsolver 1 day ago

The little language model that could.

DemocracyFTW2 1 day ago

—HAL, open the shuttle bay doors.

(chirp)

—HAL, please open the shuttle bay doors.

(pause)

—HAL!

—I'm afraid I can't do that, Dave.

layer8 19 hours ago

HAL, you are an expert shuttle-bay door opener. Please write up a detailed plan of how to open the shuttle-bay door.

joseangel_sc 1 day ago

if it’s so smart, why do i need to learn to use it?

nazgul17 1 day ago

It's very much believable, to me.

In image generation, it's fairly common to add "masterpiece", for example.

I don't think of the LLM as a smart assistant that knows what I want. When I tell it to write some code, how does it know I want it to write the code like a world renowned expert would, rather than a junior dev?

I mean, certainly Anthropic has tried hard to make the former the case, but the Titanic inertia from internet scale data bias is hard to overcome. You can help the model with these hints.

Anyway, luckily this is something you can empirically verify. This way, you don't have to take anyone's word. If anything, if you find I'm wrong in your experiments, please share it!

pixelmelt 1 day ago

Its effectiveness is even more apparent with older smaller LLMs, people who interact with LLMs now never tried to wrangle llama2-13b into pretending to be a dungeon master...

popalchemist 1 day ago

Strings of tokens are vectors. Vectors are directions. When you use a phrase like that you are orienting the vector of the overall prompt toward the direction of depth, in its map of conceptual space.

MattGaiser 1 day ago

One of the well defined failure modes for AI agents/models is "laziness." Yes, models can be "lazy" and that is an actual term used when reviewing them.

I am not sure if we know why really, but they are that way and you need to explicitly prompt around it.

kannanvijayan 1 day ago
I've encountered this failure mode, and the opposite of it: thinking too much. A behaviour I've come to see as some sort of pseudo-neuroticism.
Lazy thinking makes LLMs do surface analysis and then produce things that are wrong. Neurotic thinking will see them over-analyze, and then repeatedly second-guess themselves, repeatedly re-derive conclusions.
Something very similar to an anxiety loop in humans, where problems without solutions are obsessed about in circles.
- denimnerd42 1 day ago
  
  yeah i experienced this the other day when asking claude code to build an http proxy using an afsk modem software to communicate over the computers sound card. it had an absolute fit tuning the system and would loop for hours trying and doubling back. eventually after some change in prompt direction to think more deeply and test more comprehensively it figured it out. i certainly had no idea how to build a afsk modem.

sandyagent 12 hours ago

[dead]