← Back to context

Comment by smallerfish

2 days ago

A video demo would be useful. I can't really tell how much the application is doing from the screenshots. Is it a tool with some smart guidance, or is it doing deep magic?

I didn't think a video would be very exciting. It did feel like deep magic when I tested it though. For the scenario in the screenshots, I provided the question, "Did we really land a man on the moon?" and the null hypothesis "We landed on the moon in 1969", and the low value piece of evidence "My dad told me he saw Stanley Kubrick's moon landing set one time and he never lies." Literally everything else the LLM generated on demand for me based on its existing training data, offline. It gave me hypotheses, challenges, evidence, filled out the matrix, did the calculations, everything.

  • > Literally everything else the LLM generated on demand for me based on its existing training data, offline

    That's a ton of scope for hallucinations, surely?

    • It would be enough to drive most local LLMs crazy if it tried to generate it all at once or if it was all part of one long session, but it's set up so the LLM doesn't have to produce much at a time. I only batch in small groups (like it will generate only 3 suggestions per request) and the session is refreshed between calls, and the output is generally force structured to fit correctly into the expected format. You can, however, ask for new batches of suggestions or conflicts or evidence more than once. Hallucinations can happen for any LLM use of course, but if they break the expected structure the output is generally thrown out. Even the matrix scoring suggestion - it works on the whole row, but behind the scenes the LLM is asked to return one response in one "chat" session per column, and then they are all entered at the same time once all of them have been individually returned. That way, if the LLM does hallucinate for the score, it outputs a neutral response for that cell and doesn't corrupt any of the neighboring cells.

      If you use a smaller model with smaller context, it might be more prone to hallucinations and provide less nuanced suggestions, but the default model seems to be able to handle the jobs pretty well without having to regenerate output very often (it does happen sometimes, but it just means you have to run it again.) Also, depending on the model, you might get less variety or creativity in suggestions. It's definitely not perfect, and it definitely shouldn't be trusted to replace human judgement.

  • And the answer was... ? :)

    • Well, based on the evidence provided against our competing hypotheses, The least problematic hypothesis is that we landed on the moon in 1969. Second least problematic hypothesis was "The Apollo 11 mission was a hoax staged by NASA and the U.S. government for public relations and Cold War propaganda, but the moon landing itself was real — only the public narrative was fabricated." Third least problematic was "The Apollo 11 mission was a real event, but the moon landing was not achieved by humans — it was an automated robotic mission that was misinterpreted or falsely attributed to astronauts due to technical errors or media misreporting." - The winning hypothesis had a score of 0 (lower is better), second place had a score of 6 (out of possible 10 for our evidence set), and third place had a score of 8. There was also a tie for 4th place "It was just a government coverup to protect the firmament. There is no "outer space."" and "The Apollo 11 mission never occurred; all evidence — including photos, video, and lunar rocks — was fabricated in secret laboratories using early 20th-century special effects and staged experiments, possibly by a small group of scientists and engineers working under government contract." - both of these scored 10 out of 10, making them the most problematic. Sorry guys.

      2 replies →