Comment by Lerc

5 days ago

That's why I stated this case as an opportunity. To be able to do this you would need to have a set of examples where sometimes the dominant narrative is incorrect. This article represents one of those cases. Identifying more would be hard work and objectivity would be difficult, but I think possible.

Addressing your final point, I think there is scope for doing that. Having a provenance aspect to embeddings would do that, I suspect existing LLMs infer this information quite well already but I think there might be a possibility to go a little further at inference time by instead of a straight text to token embedding to have a richer input processing model that takes text plus other known context data to produce an embedding holding that extra data. The input processing model would have to be trained to convert the text plus context into a vector containing that info in a form the model already understands.

I think this would be useful in a number of other areas as well, firstly being able to distinctly tag model generated output so it doesn't confuse itself. Tagging individual tokens from code to say how long this code has been in the project, if it comes from a version that lints correctly, compiles, is used in production etc. Not to mention tagging prompts from the user as prompts and filtering that same tagging out of all non-prompts so that prompt injection is much harder to do.

2 comments

Lerc

friendzis 5 days ago

> To be able to do this you would need to have a set of examples where sometimes the dominant narrative is incorrect.

You are sidestepping the issue here. The issue is NOT lack of such database per se, the issue is lack of impartial oracle used to build such database.

> instead of a straight text to token embedding to have a richer input processing model that takes text plus other known context data to produce an embedding holding that extra data

The issue is lack of context informing us of the sincerity behind the claim, therefore proposal is to use LLMs to infer that. When presented with difficulties inferring that context from immediate message, your proposal is to build broader context and then feed it to LLM. To do what, spit it back packed in certain language tone? Do you see the tautology here?

Discussion around AIs/LLMs is quickly becoming discussion around Bitcoin at certain points in the hype cycle. You can't solve for external effects internally. An LLM cannot tell truth from opinion, bitcoin network cannot guarantee a transaction actually took place.

Lerc 4 days ago

>The issue is NOT lack of such database per se, the issue is lack of impartial oracle used to build such database. You don't need a perfectly impartial oracle, sure that would make the task much simpler. You build examples from instances where this is generally agreed to be the case.
The notable aspect here is that oracles predict the future. but all data is from the past. You can build examples from cases where, with the benefit of hindsight you know the truth of the outcome even if the majority consensus was the opposite at the time that the data was created.
Perhaps Bitcoin is a good example, albeit on a different time scale. There are ample examples of arguments about Bitcoin predicting various outcomes, that data is available now. At some point in the future it will be obvious which of those predicted outcomes is false. Any widely believed outcome that turned out to be false is a candidate for the data set. If there is any signal in that data to reveal that it is was going to be false the model has potential to learn it. if there is no signal in that data then on average signalless examples will balance out to have no overall impact.
>The issue is lack of context informing us of the sincerity behind the claim, therefore proposal is to use LLMs to infer that. When presented with difficulties inferring that context from immediate message, your proposal is to build broader context and then feed it to LLM. To do what, spit it back packed in certain language tone? Do you see the tautology here? There is no tautology here. The additional context is the provenance of the document and a combination of data that is clearly quantifiable by automated processes. If you note all of the examples of tagging types I gave, they are all things that can be calculated analytically. Age of code can be determined by looking at git logs, when a model generates output itself it can tag that to say that it was generated by itself. User input can be identified at the user interface and all other data that enters the model can be therefore marked as not coming via the user interface.
I proposed no LLM to do any of this, nor even for the translation of that data into the model. I suggested a model, that may involve a transformer, it may not. That input layer is only a translator to turn that additional context into terms that the LLM knows.
A LLM will have an internal way to represent the notion of code that was part of a successful build. Or that a word is from a peer reviewed scientific paper,or a reddit post, or that it does not know the source. Any information you can analytically determine about a source can be fed into the LLM along with its token if you know how the LLM represents that information in its embedding space.
Turning data we know something about into a structure that is how an existing model would represent the same thing is precisely a task that machine learning can currently solve.