← Back to context

Comment by padolsey

8 hours ago

This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!

I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).

[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...

  • I would point out that the default BPE tokenization vocabulary used by many models (cl100k_base) is already a pretty powerful shorthand. It has a lot of short tokens, sure. But then:

    Token ID 73700 is the literal entire (space-prefixed) word " strawberry". (Which neatly explains the "strawberry problem.")

    Token ID 27128 is " cryptocurrency". (And 41698 is " disappointment".)

    Token ID 44078 is " UnsupportedOperationException"!

    Token ID 58040 is 128 spaces in a row (and is the longest token in the vocabulary.)

    You'd be surprised how well this vocabulary can compress English prose — especially prose interspersed with code!

  • For a while I was missing the ability one uses all the time in stable diffusion prompts of using parentheses and floats to emphasize weight to different parts of the prompt. The more I thought about how it would work in an LLM though, the more I realized it's just reinventing code syntax and you could just give a code snippet to the LLM prompt.

Try:

“””

Your response: MILSPEC prose register. Max per-token semantic yield. Domain nomenclature over periphrasis. Hypotactic, austere. Plaintext only; omit bold.

“””

Hmm... this sounds a lot like the old RISC vs CISC argument all over again. RISC won because simplicity scales better and you can always define complex instructions in terms of simple ones. So while I would relish experiencing the timeline in which our computerized chums bootstrap into sentience through the judicious application of carefully selected and highly nuanced words, it's playing out the other way: LLMs doing a lot of 'thinking' using a small curated set of simple and orthogonal concepts.

  • RISC good. CISC bad. But CISC tribe sneaky — hide RISC inside. Look CISC outside, think RISC inside. Trick work long time.

    Then ARM come. ARM very RISC. ARM go in phone. ARM go in tablet. ARM go everywhere. Apple make ARM chip, beat x86 with big club. Many impressed. Now ARM take server too. x86 tribe scared.

    RISC-V new baby RISC. Free for all. Many tribe use. Watch this one.

    RISC win brain fight. x86 survive by lying. ARM win world.