← Back to context

Comment by Lerc

1 day ago

>what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing?

Well for starters, it puts the lie to the argument that a transformer can only output examples it has seen before. Performing the calculation on examples that haven't been seen demonstrates generalisation of the principles and not regurgitation.

While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.

Are people usually claiming that it strictly cannot produce any output it hasn't seen before? I wouldn't agree, I mean clearly they are generating some form of new content. My argument would be that while they can learn to some extent, the power of their generalisation is still tragically weak, particularly in some domains.

>it puts the lie to the argument

But it does not, right? You can either show it something, or modify the parameters in a way that resemble the result of showing it something.

You can claim that the model didn't see the thing, but that would mean nothing, because you are making the same effect with parameter tweaks indirectly.

  • That's a counterargument to a different thing.

    Iteratively measuring loss is a way to reconstruct values. That's trivial to show for a single value If 5 gives you a loss of 2 and 9 gives you a loss of 2 then you know the missing value is 7.

    A model with enough parameters can memorise the training set in a similar manner. Technically the model hasn't seen that data by direct input either, but the mechanism provides the means to determine the what the data was. In that respect it is reasonable to say the model has seen the data.

    Performing well on examples not in the training set is doing something else.

    Any attempt to characterise that as having been seen before negates any distinction between taking in data and reasoning about that data.

    • Yea, because "seeing" is also tweaking the parameters. Which this example is doing manually.

      So I don't understand how any one can make the claim that the model as not seen it. Because the internal transformation is similar.

      4 replies →