Comment by sethops1

2 months ago

> Testing GPT-5, Claude, Gemini, Grok, and DeepSeek with $100K each over 8 months of backtested trading

So the results are meaningless - these LLMs have the advantage of foresight over historical data.

18 comments

sethops1

> We were cautious to only run after each model’s training cutoff dates for the LLM models. That way we could be sure models couldn’t have memorized market outcomes.

stusmall 2 months ago
Even if it is after the cut off date wouldn't the models be able to query external sources to get data that could positively impact them? If the returns were smaller I could reasonably believe it but beating the S&P500 returns by 4x+ strains credulity.
- cheeseblubber 2 months ago
  
  We used the LLMs API and provided custom tools like a stock ticker tool that only gave stock price information for that date of backtest for the model. We did this for news apis, technical indicator apis etc. It took quite a long time to make sure that there weren't any data leakage. The whole process took us about a month or two to build out.
  
  1 reply →
plufz 2 months ago
I know very little about how the environment where they run these models look, but surely they have access to different tools like vector embeddings with more current data on various topics?
- endtime 2 months ago
  
  If they could "see" the future and exploit that they'd probably have much higher returns.
  
  2 replies →
- disconcision 2 months ago
  
  you can (via the api, or to a lesser degree through the setting in the web client) determine what tools if any a model can use
  
  3 replies →

itake 2 months ago

> We time segmented the APIs to make sure that the simulation isn’t leaking the future into the model’s context.

I wish they could explain what this actually means.

devmor 2 months ago

It's a very silly way of saying that the data the LLMs had access to was presented in chronological order, so that for instance, when they were trading on stocks at the start of the 8 month window, the LLMs could not just query their APIs to see the data from the end of the 8 month window.
nullbound 2 months ago

Overall, it does sound weird. On the one hand, assuming I properly I understand what they are saying is that they removed model's ability to cheat based on their specific training. And I do get that nuance ablation is a thing, but this is not what they are discussing there. They are only removing one avenue of the model to 'cheat'. For all we know, some that data may have been part of its training set already...

CPLX 2 months ago

Not sure how sound the analysis is but they did apparently actually think of that.

joegibbs 2 months ago

That's only if they're trained on data more recent than 8 months ago