Comment by airylizard

2 months ago

Why I came up with TSCE(Two-Step Contextual Enrichment).

+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.

Free open framework, check the repo try it yourself

https://github.com/AutomationOptimization/tsce_demo

I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".

Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.

It works, all the data as well as the entire script used for testing is in the repo.

7 comments

airylizard

arnaudsm 2 months ago

That's a lot of kilo-watt-hours wasted for a find and replace operation.

Have you heard of text.replace("—", "-") ?

airylizard 2 months ago

The test isn't for how well an LLM can find or replace a string. It's for how well it can carry out given instructions... Is that not obvious?

thegeomaster 2 months ago

I slightly tweaked your baseline em dash example and got 100% success rate with GPT-4.1 without any additional calls, token spend, or technobabble.

System prompt: "Remove every em-dash (—) from the following text while leaving other characters unchanged.\n\nReturn only the cleaned text."

User prompt: <prompt from tsce_chat.py filled with em dashes>

Temperature: 0.0

airylizard 2 months ago
Hey, thanks for kicking the tires! The run you’re describing was done in mid-April, right after GPT-4.1 went live. Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.
If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.
For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!
- thegeomaster 2 months ago
  
  > Since then OpenAI has refreshed the weights behind the “gpt-4.1” alias a couple of times, and one of those updates fixed the em-dash miss.
  I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].
  Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.
  [1]: https://platform.openai.com/docs/models/gpt-4.1
  
  1 reply →