Comment by ineedasername

9 hours ago

This is going to be task-dependent, as well as limited by your (the implementer's) ability and comfort with structuring the task in solid multi-shot prompts that cover a large distribution of expected inputs, which will also help increase the ability for the model to successfully handle less common or edge case inputs-- the ones the would most typically require human-level reasoning. It can be useful to supplement this with a "tool" use for RAG lookup against a more extensive store of examples, or any time the full reference material isn't practical to dump into context. This requires thoughtful chunking.

It also requires testing. Don't think of it as a magic machine that should be able to do anything, think of it like a new employee smart enough and with enough background knowledge to do the task, if given proper job documentation. Test whether few-shot or many shot prompting works better: there's growing information about use cases where one or the other confers an advantage but so much of this is task dependent.

Consider your tolerance for errors and plan some escalation method: Hallucinations occur in part because models "have to" give an answer. Make sure that any critical cases where an error would be problematic have some way for the model to bail out with "i don't know" for human review. The first layer of escalation doesn't even have to be a human, it could be a separate model, eg Opus instead of Sonnet, or the same model but with a different setup prompt explicitly designed for handling certain cases without cluttering up context of the first one. Splitting things in this way, if there's a logical break point, is also a great way to save on token cost: If you can send only 10k of tokens in a system prompt instead of 50k and just choose which of 5 10k prompts to use for different cases then you save 80% of upstream token $$.

Consider running the model deterministic: 0 temp, same seed. It makes any errors you encounter easier to trace and debug.

Something to consider with respect to cost though: Many tasks that a SoTA can do with very little or no scaffolding can be done with these cheaper models and may not take much more scaffolding. If a SoTA giving reliable responses with zero shot prompting there's a decent chance you can save a ton of money with a flash model if you provide it one or few shot prompts. Open weight models even more so.

My anecdotal experience is that open models like Google's gemma and OpenAI's gpt-oss have behaviors more similar to their paid counterparts than other open models, making them reasonable candidates to try if you're getting good results from the paid models but they're perhaps overkill for the task.