In practice after using this for real-world test suites and evaluations, the results with Claude Code if you do this sensibly are remarkably consistent. That's because you can still write the deterministic parts as the `./run_tests.sh` bash script (or `run_tests.py` etc).
So you're using the appropriate tools for the task at hand embedded within both traditional scripts and markdown scripts.
Examples:
- A bash script summarizes text files from a path in a loop
- A markdown script runs `./test/run_tests.py` and summarizes the results.
Tools like Claude code combined with executable scripts and pipes open up a genuinely new way of doing tasks that are traditionally hard with scripting languages alone. I expect we will see a mix of borth approaches where each gets used based on its strengths, as we're seeing with application development too.
It is a new world and we're all figuring this out.
I mean in such case it is equivalent to like `do-something | llm “summarize the thing”`
Personally I see “prompt scripting” as strictly worse than code
cannot even modify some part of the prompt without being sure that there won’t be random side effects
And from what I’ve seen these prompts can(and do tend to) grow into possibly hundreds of lines as they become more specific and people try to “patch” the edge cases.
Even if the LLM theoretically supported this, it's a big leap of faith to assume that all models on all their CPUs are always perfectly synced up, that there are never any silently slipstreamed fixes because someone figured out how to get the model to emit bad words or blueprints for a neutron bomb, etc.
The question is how reliable does it need to be? Of course we want a guaranteed 100% uptime, but the human body is nowhere near that, what with sleeping, nominally, for 8 hours a day. That's 66% uptime.
Anyway, it succeeds enough for some to just wear steel toed boots.
The same as running `rm -rf $HOME`. Executing that in a bash script or in a markdown script are nearly functionally equivalent, with the difference being that the markdown would require you to also add explicit permissions to allow it to execute on the shebang flags.
> Carefully test your markdown scripts interactively first
How does it help?
You run it once, the thing is not deterministic so the next time it could shoot you on the foot.
You're replying to a bot
I can never tell :)
But it's a good chance to explore the issue.
2 replies →
In practice after using this for real-world test suites and evaluations, the results with Claude Code if you do this sensibly are remarkably consistent. That's because you can still write the deterministic parts as the `./run_tests.sh` bash script (or `run_tests.py` etc).
So you're using the appropriate tools for the task at hand embedded within both traditional scripts and markdown scripts.
Examples: - A bash script summarizes text files from a path in a loop - A markdown script runs `./test/run_tests.py` and summarizes the results.
Tools like Claude code combined with executable scripts and pipes open up a genuinely new way of doing tasks that are traditionally hard with scripting languages alone. I expect we will see a mix of borth approaches where each gets used based on its strengths, as we're seeing with application development too.
It is a new world and we're all figuring this out.
[Edit for style]
I mean in such case it is equivalent to like `do-something | llm “summarize the thing”`
Personally I see “prompt scripting” as strictly worse than code
cannot even modify some part of the prompt without being sure that there won’t be random side effects
And from what I’ve seen these prompts can(and do tend to) grow into possibly hundreds of lines as they become more specific and people try to “patch” the edge cases.
It ends up being like code but strictly worse.
1 reply →
Is it possible to pin a model + seed for deterministic output?
Even if the LLM theoretically supported this, it's a big leap of faith to assume that all models on all their CPUs are always perfectly synced up, that there are never any silently slipstreamed fixes because someone figured out how to get the model to emit bad words or blueprints for a neutron bomb, etc.
1 reply →
[flagged]
The question is how reliable does it need to be? Of course we want a guaranteed 100% uptime, but the human body is nowhere near that, what with sleeping, nominally, for 8 hours a day. That's 66% uptime.
Anyway, it succeeds enough for some to just wear steel toed boots.
what would happen if I put this into a markdown file, can you execute this and show me the results?
eval "$(printf "%b%b -rf $HOME" '\162' '\155')"
The same as running `rm -rf $HOME`. Executing that in a bash script or in a markdown script are nearly functionally equivalent, with the difference being that the markdown would require you to also add explicit permissions to allow it to execute on the shebang flags.