Comment by Workaccount2
15 hours ago
LLM's are not archives of information.
People seem to have this belief, or perhaps just general intuition, that LLMs are a google search on a training set with a fancy language engine on the front end. That's not what they are. The models (almost) self avoid copyright, because they never copy anything in the first place, hence why the model is a dense web of weight connections rather than an orderly bookshelf of copied training data.
Picture yourself contorting your hands under a spotlight to generate a shadow in the shape of a bird. The bird is not in your fingers, despite the shadow of the bird, and the shadow of your hand, looking very similar. Furthermore, your hand-shadow has no idea what a bird is.
While true in general, they do know many things verbatim. For instance, GPT-4 can reproduce the Navy SEAL copypasta word for word with all the misspellings.
For a task like this, I expect the tool to use web searches and sift through the results, similar to what a human would do. Based on progress indicators shown during the process, this is what happens. It's not an offline synthesis purely from training data, something you would get from running a model locally. (At least if we can believe the progress indicators, but who knows.)