Comment by usefulcat
1 day ago
I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably.
Sure, maybe someday LLMs will be able to report facts in a mostly reliable fashion (like a typical calculator), but we're definitely not even close to that yet, so until we are the skepticism is very much warranted. Especially when the details really do matter, as in scientific research.
> whether the researcher's calculator was working reliably.
LLM's do not work reliably, that's not their purpose.
If you use them that way it's akin to using a butter knife as a screwdriver. You might get away with it once or twice, but then you slip and stab yourself. Better to go find screwdriver if you need reliable.
> I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably
Reproducibility and repeatability in the sciences?
Replication crisis > Causes > Problems with the publication system in science > Mathematical errors; Causes > Questionable research practices > In AI research, Remedies > [..., open science, reproducible workflows, disclosure, ] https://en.wikipedia.org/wiki/Replication_crisis#Mathematica...
Already verifiable proofs are too impossibly many pages for human review
There are "verify each Premise" and "verify the logical form of the Argument" (P therefore Q) steps that still the model doesn't do for the user.
For your domain, how insufficient is the output given process as a prompt like:
Identify hallucinations from models prior to (date in the future)
Check each sentence of this: ```{...}```
Research ScholarlyArticles (and then their Datasets) which support and which reject your conclusions. Critically review findings and controls.
Suggest code to write to apply data science principles to proving correlative and causative relations given already-collected observations.
Design experiment(s) given the scientific method to statistically prove causative (and also correlative) relations
Identify a meta-analytic workflow (process, tools, schema, and maybe code) for proving what is suggested by this chat