← Back to context

Comment by cheeseblubber

20 hours ago

OP here. We realized there are a ton of limitations with backtest and paper money but still wanted to do this experiment and share the results. By no means is this statistically significant on whether or not these models can beat the market in the long term. But wanted to give everyone a way to see how these models think about and interact with the financial markets.

What were the risk adjusted returns? Without knowing that, this is all kind of meaningless. Being high beta in a rising market doesn't equate to anything brilliant.

You're not really giving them any money and it's not actually trading.

There's no market impact to any trading decision they make.

You should redo this with human controls. By a weird coincidence, I have sufficient free time.

I can almost guarantee you that these models will underperform the market in the long run, because they are simply not designed for this purpose. LLMs are designed to simulate a conversation, not predict forward returns of a time series. What's more, most of the widely disseminated knowledge out there on the topic is effectively worthless, because there is an entire cottage industry of fake trading gurus and grifters, and the LLMs have no ability to separate actual information from the BS.

If you really wanted to do this, you would have to train specialist models - not LLMs - for trading, which is what firms are doing, but those are strictly proprietary.

The only other option would be to train an LLM on actually correct information and then see if it can design the specialist model itself, but most of the information you would need for that purpose is effectively hidden and not found in public sources. It is also entirely possible that these trading firms have already been trying this: using their proprietary knowledge and data to attempt to train a model that can act as a quant researcher.

> Grok ended up performing the best while DeepSeek came close to second.

I think you mean "DeepSeek came in a close second".

  • OK, now it says:

    > Grok ended up performing the best while DeepSeek came close second.

    "came in a close second" is an idiom that only makes sense word-for-word.

Cool experiment.

I have a PhD in capital markets research. It would be even more informative to report abnormal returns (market/factor-adjusted) so we can tell whether the LLMs generated true alpha rather than just loading on tech during a strong market.

I think it would be interesting to see how it goes in a scenario where the market declines or where tech companies underperform the rest of the market. In recent history they've outperformed the market and that might bias the choices that the LLMs make - would they continue with these positive biases if they were performing badly?

These are LLMs - next token guessers. They don't think at all and I suggest that you don't try to get rich quick with one!

LLMs are handy tools but no more. Even Qwen3-30B heavily quantised will do a passable effort of translating some Latin to English. It can whip up small games in a single prompt and much more and with care can deliver seriously decent results but so can my drill driver! That model only needs a £500 second hand GPU - that's impressive for me. Also GPT-OSS etc.

Yes, you can dive in with the bigger models that need serious hardware and they seem miraculous. A colleague had to recently "force" Claude to read some manuals until it realised it had made a mistake about something and frankly I think "it" was only saying it had made a mistake. I must ask said colleague to grab the reasoning and analyse it.

> But wanted to give everyone a way to see how these models think…

Think? What exactly did “it” think about?

  • You can click in to the chart and see the conversation as well as for each trade what was the reasoning it gave for it

    • A model can't tell you why it made the decision.

      What it can do is inspect the decision it made and make up a reason a human might have said when making the decision.