Comment by tifa2up

17 hours ago

We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.

12 comments

tifa2up

4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.

HugoDias 16 hours ago

Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

tifa2up 16 hours ago
For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
- internet_points 16 hours ago
  
  Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
  
  1 reply →
- Shank 16 hours ago
  
  ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
  
  1 reply →
- Xmd5a 15 hours ago
  
  Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.

mbesto 11 hours ago

How do you objectively tell whether a model "performs" better than another?

belval 11 hours ago
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
- mbesto 5 hours ago
  
  > but I work in the space
  Ya, the original commenter likely does not work in the space - hence the ask.
  > the evaluation of new models is actually very quantitative.
  While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.

teekert 17 hours ago

So… You did look back then didn’t look forward anymore… sorry couldn’t resist.