Comment by thegeomaster

1 year ago

I slightly tweaked your baseline em dash example and got 100% success rate with GPT-4.1 without any additional calls, token spend, or technobabble.

System prompt: "Remove every em-dash (—) from the following text while leaving other characters unchanged.\n\nReturn only the cleaned text."

User prompt: <prompt from tsce_chat.py filled with em dashes>

Temperature: 0.0

4 comments

thegeomaster

airylizard 1 year ago

Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.

If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.

For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!

thegeomaster 1 year ago
> Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.
I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].
Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.
[1]: https://platform.openai.com/docs/models/gpt-4.1
- airylizard 1 year ago
  
  Right, the 4.1 training checkpoint hasn’t moved. What has moved is the glue on top: decoder heuristics / safety filters / logit-bias rules that OpenAI can hot-swap without re-training the model. Those “serving-layer” tweaks are what stomped the obvious em-dash miss for short, clean prompts. So the April-14 weights are unchanged, but the pipeline that samples from those weights is stricter about “don’t output X” than it was on day one. By all means, keep trying to poke holes! I’ve got nothing to sell; just sharing insights and happy to stress-test them.