Comment by water-data-dude
3 days ago
It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.
Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.
There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?
This.
Especially for London there is a huge chunk of recorded parliament debates.
More interesting for dialoge seems training on recorded correspondence in form of letters anyway.
And that corpus script just looks odd to say the least, just oversample by X?
Oh! I honestly didn't think about that, but that's a very good point!
Just Ctrl+F the data. /s