Comment by manquer

21 hours ago

Guardrails are a rough analogue to binding parameters in SQL perhaps.

These methods do work better than prompting. For example Prompting alone for example has much poor reliability in spitting out JSON output adhering to a schema consistently. OpenAI cited 40% for prompts versus 100% reliablity with their fine-tuning for structured outputs [1].

Content moderation is more of course challenging and more nebulous. Justice Porter famously defined the legal test for hard core pornographic content as "I will know it when I see it" [Jacobellis v. Ohio | 378 U.S. 184 (1964)].

It is more difficult for a model marketed as lightly moderated like Grok.

However that doesn't mean the other methods don't work or are not being used at all.

[1] https://openai.com/index/introducing-structured-outputs-in-t...

[2] https://en.wikipedia.org/wiki/Jacobellis_v._Ohio

2 comments

manquer

simonw 20 hours ago

The structured data JSON output thing is a special case: it works by interacting directly with the "select next token" mechanism, restricting the LLM to only picking from a token that would be valid given the specified schema.

This makes invalid output (as far as the JSON schema goes) impossible, with one exception: if the model runs out of output tokens the output could be an incomplete JSON object.

Most of the other things that people call "guardrails" offer far weaker protection - they tend to use additional models which can often be tricked in other ways.

manquer 20 hours ago

You are right of course.
I didn't mean to imply that all methods give 100% reliability as the structured data does. My point was just that there are non system prompt approaches which give on par or better reliability and/or injection security, it is not just system prompt or bust as other posters suggest.