Comment by Dylan16807

5 days ago

So in your example you can apply just one test result at a time, in any order. And the more pieces of evidence you apply, the stronger your argument gets.

f = "The test(s) say the patient is a vampire, with a .01 false positive rate."

f∘f∘f = "The test(s) say the patient is a vampire, with a .000001 false positive rate."

In the chain example f or g or h on its own is useless. Only f∘g∘h is relevant. And f∘g∘h is a lot weaker than f or g or h appears on its own.

This is what a logic chain looks like, adapted for vampirism to make it easier to compare:

f: "The test says situation 1 is true, with a 10% false positive rate."

g: "If situation 1 then situation 2 is true, with a 10% false positive rate."

h: "If situation 2 then the patient is a vampire, with a 10% false positive rate."

f∘g∘h = "The test says the patient is a vampire, with a 27% false positive rate."

So there are two key differences. One is the "if"s that make the false positives build up. The other is that only h tells you anything about vampires. f and g are mere setup, so they can only weaken h. At best f and g would have 100% reliability and h would be its original strength, 10% false positive. The false positive rate of h will never be decreased by adding more chain links, only increased. If you want a smaller false positive rate you need a separate piece of evidence. Like how your example has three similar but separate pieces of evidence.

8 comments

Dylan16807

godelski 4 days ago

Again, my only argument was that you can have both situations occur. We could still construct a f∘g∘h to increase probability if we want. I'm not saying it cannot go down, I'm saying there's no absolute rule you can follow.

Dylan16807 4 days ago
I don't think you can make a chain of logic f∘g∘h where the probability of the combined function is higher than the probability of f or g or h on their own.
Chain of logic meaning that only the last function updates the probability you care about, and the preceeding functions give you intermediate information that is only useful to feed into the next function.
It is an absolute rule you can follow, as long as you're applying it the way it was intended, to a specific organization of functions. It's not any kind of combining, it's A->B->C->D combining. As opposed to multiple pieces that each independently imply D.
Just because you can use ∘ in both situations doesn't make them the same. Whether x∘y∘z is chaining depends on what x and y and z do. If all of them update the same probability, that's not chaining. If removing any of them would leave you with no information about your target probability, then it's chaining.
TL;DR: ∘ doesn't tell you if something is a chain, you're conflating chains with non-chains, the rule is useful when it comes to chains
- godelski 4 days ago
  
  I'm not disagreeing with you. You understand that, right?
  The parent was talking about stringing together inferences. My argument *was how you string them together matters*. That's all. I said "context matters."
  I tried to reiterate this in my previous comment. So let's try one more time. Again, I'm not going to argue you're wrong. I'm going to argue that more context is needed to determine if likelihood increases or decreases. I need to stress this before moving on.
  Let's go one more comment back, when I'm asking you if you're sure that this doesn't apply to the Bayesian case too. My point here was that, again, context matters. Are these dependent or independent? My whole point is that we don't know which direction things will go in without additional context. I __am not__ making the point that it always gets better like in the Bayesian example. The Bayesian case was _an example_. I also gave an example for the other case. So why focus on one of these and ignore the other?
  > ∘ doesn't tell you if something is a chain
  ∘ is the composition operator (at least in this context and you also interpreted it that way). So yes, yes it does. It is the act of chaining together functions. Hell, we even have "the chain rule" for this. Go look at the wiki if you don't believe me, or any calculus book. You can go into more math and you'll see the language change to use maps to specify the transition process.
  > It's not any kind of combining, it's A->B->C->D combining.
  Yes, yes it does. The *events* are independent but the *states* are dependent. Each test does not depend on the previous test, making the tests independent, but our marginal is! Hell, you see this in basic Markov Chains too. The decision process does not depend on other nodes in the chain but the state does. If you want to draw our Bayesian example as a chain you can do so. It's going to be really fucking big because you're going to need to calculate all potential outcomes making it both infinitely wide and infinitely deep, but you can. The inference process allows us to skip all those computations and lets us focus on only performing calculations for states we transition into.
  Just ask yourself, how did you get to state B? *You drew arrows for a reason*. But arrows only tell us about a transition occurring, they do not tell us about that transition process. They lack context.
  > you're conflating chains with non-chains
  No, you're being too strict in your definition of "chain". Which, brings us back to my first comment.
  Look, we can still view both situations from the perspective of Markov Chains. We can speak about this with whatever language we want but if you want chains let's use something that is clearly a chain. Our classic MC is the easy case, right? Our state only depends on the previous state, right? P(x_{t}|x_{t-1}). Great, just like the Bayesian case (our state is dependent but our transition function is independent). So we can also have higher order MCs, depending on any n previous state. We can extend our transition function too. P(x_{t}|x_{t-1},...,x_0) = Q. We don't have to restrict ourselves to Q(x_{t-1}), we can do whatever the hell we want. In fact, our simple MC process is going to be equivalent to Q(x_{t-1},...,x_0) it is just that nothing ends up contributing except for that x_{t-1}. The process is still the same, but the context matters.
  > It's not any kind of combining, it's A->B->C->D combining. ***As opposed to multiple pieces that each independently imply D.***
  This tells me you drew your chain wrong. If multiple things are each contributing to D independently then that is not A->B->C->D (or as you wrote the first time: `A->B, B->C, C->D`, which is equivalent!) you instead should have written something like A -> C <- B. Or using all 4 letters
  B | v A -> D <- C
  These are completely different things! This is not a sequential process. This is not (strictly) composition.
  And yet, again, we still do not know if these are decreasing. They will decrease if A,B,C,D ∈ ℙ AND our transition functions are multiplicative (∏ x_i < x_j ∀ j ; where x_i ∈ ℙ), but this will not happen if the transition function is additive (∑ x_i ≥ x_j ∀ j ; where x_i ∈ ℙ)
  We are still entirely dependent upon context.
  Now, we're talking about LLMs, right? Your conversation (and CoT) is much closer to the Bayesian case than our causal DAG with dependence. Yes, the messages in the conversation transition us through states, but the generation is independent. The prompt and context lengthen, but this is not the same thing as the events being dependent. The LLM response is an independent event. Like the BI case the state has changed, but the generation event is identical (i.e. independent). We don't care how we got to the current state! You don't need to have the conversation with the LLM. Every inference from the LLM is independent, even if the state isn't. The inference only depends on the tokens currently in context. Assuming you turn on deterministic mode (setting seeds identically), you could generate an identical output by passing the conversation (and properly formatting) into a brand new fresh prompt. That shows that the dependence is on state, not inference. Just like our Bayesian example you'd generate the same output if you start from the same state. The independence is because we don't care how we got to that state, only that we are at that state (same with simple MCs). There are added complexities that can change this but we can't go there if we can't get to this place first. We'd need to have this clear before we can add complexities like memory and MoEs because the answer only gets more nuanced.
  So again, our context really matters here and the whole conversation is about how these subtleties matter. The question was, if those errors compound. I hope you see that that's not so simple to answer. *Personally*, I'm pretty confident they will in current LLMs, because they rely far too heavily on their prompting (it'll give you incorrect answers if you prime it that way despite being able to give correct answers with better prompting) but this isn't a necessary condition now, is it?
  TLDR: We can't determine if likelihood increases or decreases without additional context
  
  5 replies →