← Back to context

Comment by Night_Thastus

5 days ago

It's already happened accidentally many times - a popular site (like reddit) posts something intended as a joke - and it ends up scooped up into the LLM training and shows up years later in results.

It's very annoying. It's part of the problem with LLMs in general, there's no quality control. Their input is the internet, and the internet is full of garbage. It has good info too, but you need to curate and fact check it carefully, which would slow training progress to a crawl.

Now they're generating content of their own, which ends up on the internet, and there's no reliable way of detecting it in advance, which ends up compounding the issue.

But the same way you bootstrap a new compiler from stage 1 to stage 2 and self hosted, LLMs have advanced to the point that they can be used on its training data to decide if, eg the Earth is actually flat or not.

  • Most facts about the world can't be deduced from logic. They're just facts, to memorize. The King's lefthanded. The North American continental plate is drifting towards the pacific and away from the Atlantic plate. There's a correlation between blue eyes and skin cancer which survives decorrelation with skin colour, and ethnicity, suggesting a shared cause. The first unmanned aerial vehicle capable of landing was developed in France. A general named Rogers led the British in the war of 1812.

    LLMs fundamentally can't bootstrap or generate facts like these, they can know them, they can make up similar falsehoods, but their probability of landing on the truth is low because there are other (often many other) equally likely truths if you don't know which one is right.

    (Please note: I made up all the "facts" in this post)

    • Are you saying human brain is kind of similarly vulnerable to well-crafted facts? Does it mean any intelligence (human or non-human) needs a large amount of generally factual data to discern facts from fakes, which is an argument toward AIs that can accumulate huge swath of factual data?

      1 reply →

  • The difference that a compiler is (generally) deterministic. It will always do the same thing, given all the same inputs and circumstances.

    An LLM is not, it's probabilistic text. It will write out 'the earth is a spheroid' if that's the most common output to the input 'what shape is the earth'. But it does not understand what it is writing. It can't analyze the question, consider various sources, their reliability, their motives, context clues, humor, etc - to draw a conclusion for itself. It can't make a mistake and then learn from that mistake when corrected.

    • probabilistically, why does that matter? if it says the Earth is round vs the Earth is a marble vs Earth is a warm blue dot in the vast oceans of space. Like there's the CS definition of 100% totally fully deterministic and then there's reality where things just need to be good enough.

      2 replies →

    • There is no reason to believe an LLM answers a question with the most common answer on the internet.

      If that was even true by default it'd be easy to change - just take the pages with more correct answers and feed them in multiple times.

      2 replies →