← Back to context

Comment by binsquare

3 months ago

This is oddly a case to signify there is value in an AI moderation tools - to avoid bias inherent to human actors.

The AI moderation tools are trained on the Reddit data that is actively being sabotaged by a competitor. If an AI were to take up moderation now, mentioning this specific bootcamp would probably get you warned or banned because of how bad it is according to the training data.

AI is as biased as humans are, perhaps even more so because it lacks actual reasoning capabilities.

  • > [AI] lacks actual reasoning capabilities.

    Evals are showing reasoning (by which I mean multi-step problem solving, planning, etc) is improving over time in LLMs. We don't have to agree on metaphysics to see this; I'm referring to the measurable end result.

    Why? Some combination of longer context windows, better architectures, hybrid systems, and so on. There is more research about how and where reasoning happens (inside the transformer, during the chain of thought, perhaps during a tool call).

Getting rid of bias in LLM training is a major research problem and anecdotally e.g., to my surprise, Gemini infers gender of the user depending on the prompt/what the question is about; by extension it’ll have many other assumptions about race, nationality, political views, etc.

  • > to my surprise, Gemini infers gender of the user depending on the prompt/what the question is about

    What, automatically (and not, say, in response to a "what do you suppose my gender is" prompt)? What evidence do we have for this?

> to avoid bias inherent to human actors.

Do you understand how AI tools are trained?

  • Here is my high level take: most AI researchers I trust recognize that AI alignment is at least fiendishly hard and probably impossible. This breaks down into at least two parts. First, codifying the values of a group of people is hard and impossible to do neutrally, since many sets of reasonable desiderata fail various impossibility theorems, not to mention the practical organizing difficulties. Second, ensuring the AI generalizes and behaves correctly according to a based on supervised learning over a set of examples is likely impossible, due to the well-known problems of out-of-distribution behavior.

    Of course, we can't let the perfect be the enemy of the better. We must strive to align our systems better over time. Some ways include: (a) hybrid systems that use provably-correct subsystems; (b) better visibility, vetting, and accountability around training data; (c) smart regulation that requires meaningful disclosure (such as system cards); (d) external testing, including red-teaming; (e) reasoning out loud in English (not in neuralese!); and more.

  • Below I'll list each quote and rewrite them with elaboration to unpack some unstated assumptions. (These are my interpretations; they may be different than what the authors intended.)

    > 1: This is oddly a case to signify there is value in an AI moderation tools - to avoid bias inherent to human actors.

    "To the extent (1) AI moderation tools don't have conflicting interests (such as an ownership stake in a business); (2) their decisions are guided by some publicly stated moderation guidelines; (3) they make decisions openly with chain-of-thought, then such decisions may be more reliable and trustworthy than decisions made by a small group of moderators (who often have hidden agendas)."

    > 2: Do you understand how AI tools are trained?

    "In the pretraining phase, LLMs learn to mimic the patterns in the training text. These patterns run very deep. To a large extent, fine-tuning (e.g. with RLHF) shapes the behavior of the LLM. Still, some research shows the baseline capabilities learned during pretraining still exist after fine-tuning, which means various human biases remain."

    Does this sound right to the authors? From what I understand, when unpacked in this way, both argument structures are valid. (This doesn't mean the assumptions hold, though.)