Comment by kristjansson

17 days ago

> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

25 comments

kristjansson

energy123 17 days ago

It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

SeanAnderson 17 days ago

Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)

logicchains 17 days ago

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

orlp 17 days ago
> N tokens looking at N tokens is quadratic
Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
total = 0 for i in range(len(a)): for j in range(len(b)): total += a[i] * b[j]
This can be computed in linear time as sum(a) * sum(b).
Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.
- CrazyStat 17 days ago
  
  One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral
  \iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy
  which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).
  My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.
- logicchains 17 days ago
  
  That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.
- anvuong 17 days ago
  
  This brings me back to DSP class, man learning about FFT was eye-opening.
- noosphr 17 days ago
  
  Convolution is a local operation.
  Attention is a global operation.
  
  1 reply →
naasking 17 days ago
Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.
- logicchains 17 days ago
  
  It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.
  
  3 replies →
hellohello2 17 days ago
I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:
Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.
- nine_k 16 days ago
  
  Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.
  By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier
- sifar 16 days ago
  
  Multiplication can be sub-quadratic using Karatsuba's algorithm.
  
  1 reply →
- actionfromafar 17 days ago
  
  Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?
  
  1 reply →
- logicchains 17 days ago
  
  Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.
  
  1 reply →
oasisaimlessly 17 days ago

That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.