Comment by Lerc

4 days ago

>The issue is NOT lack of such database per se, the issue is lack of impartial oracle used to build such database. You don't need a perfectly impartial oracle, sure that would make the task much simpler. You build examples from instances where this is generally agreed to be the case.

The notable aspect here is that oracles predict the future. but all data is from the past. You can build examples from cases where, with the benefit of hindsight you know the truth of the outcome even if the majority consensus was the opposite at the time that the data was created.

Perhaps Bitcoin is a good example, albeit on a different time scale. There are ample examples of arguments about Bitcoin predicting various outcomes, that data is available now. At some point in the future it will be obvious which of those predicted outcomes is false. Any widely believed outcome that turned out to be false is a candidate for the data set. If there is any signal in that data to reveal that it is was going to be false the model has potential to learn it. if there is no signal in that data then on average signalless examples will balance out to have no overall impact.

>The issue is lack of context informing us of the sincerity behind the claim, therefore proposal is to use LLMs to infer that. When presented with difficulties inferring that context from immediate message, your proposal is to build broader context and then feed it to LLM. To do what, spit it back packed in certain language tone? Do you see the tautology here? There is no tautology here. The additional context is the provenance of the document and a combination of data that is clearly quantifiable by automated processes. If you note all of the examples of tagging types I gave, they are all things that can be calculated analytically. Age of code can be determined by looking at git logs, when a model generates output itself it can tag that to say that it was generated by itself. User input can be identified at the user interface and all other data that enters the model can be therefore marked as not coming via the user interface.

I proposed no LLM to do any of this, nor even for the translation of that data into the model. I suggested a model, that may involve a transformer, it may not. That input layer is only a translator to turn that additional context into terms that the LLM knows.

A LLM will have an internal way to represent the notion of code that was part of a successful build. Or that a word is from a peer reviewed scientific paper,or a reddit post, or that it does not know the source. Any information you can analytically determine about a source can be fed into the LLM along with its token if you know how the LLM represents that information in its embedding space.

Turning data we know something about into a structure that is how an existing model would represent the same thing is precisely a task that machine learning can currently solve.