Comment by wk_end

12 hours ago

> 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.

Since it seems to just produce broken and nonsensical sentences (at least based on the one example given) I'm not sure if it does work at this scale.

Anyway, as written this passage doesn't really make a whole lot of sense (the point is that it produces broken sentences?), and given that it was almost certainly written by an AI, it demonstrates that the architecture doesn't work especially well at any scale (I kid, I kid).

How does it compare to a Markov chain generator I wonder.

  • The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).

    On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.