Comment by dchu17

19 days ago

Yes this is a major problem I thought about. The makeshift solution here was to redact the “identifying information” on the press release. Even then, I benchmarked that GPT-5 could still match it back to the right TIKR around 53% of the time. It does not seem to be able to recall the price of the stock in my benchmark, but to be honest I’m not entirely sure how trustworthy this benchmark is and I may need to come up with a few more clever solutions to validate.

One solution could be to get experts to write similar press releases so that the text itself is out of distribution or if an actual quant firm has internal models, they can just make sure that there is a cutoff date to the pre-training data.

I'm curious, when you ran a quant fund, what was your approach?

1 comment

dchu17

mempko 18 days ago

We didn't use other LLMs. We built our own models and had a system designed to never leak future information at any given timepoint. Models can only access data the system allowed it at any time point gating future information. This means even training has to use the same system.

You have to design it from the ground up with that approach. Just to give you an idea of how hard it is, when a company releases an earnings report, they can update it in the future with corrected information, so if you pull it later you will leak future information into the past. So even basics like earnings need to be versioned by time.

But you know, most people don't really care and think they have an edge, and who knows maybe they do. Only live trading will prove it.