Comment by ustad

5 months ago

I’m sure the gazillions of online references to the ASCII Table have something to do with it… no?

4 comments

ustad

In that it created a circuit inside the shoggoth where it translates between hex and letters, sure, but this is not a straight lookup, it's not like a table, any more than that I know "FF " is 255. This is not stochastic pattern matching any more than my ability to look at a raw hex and see structures, ntfs File records and the like (yes, I'm weird, I've spent 15 years in forensics)- in the same way that you might know some French and have a good guess at a sentence if your English and Italian is fluent.

z3c0 5 months ago

Or even conversations presented entirely hex. Not only could that have occurred naturally in the wild (pre-2012 Internet shenanigans could get pretty goofy), it would be an elementary task to represent a portion of the training corpus in various encodings.

Kostchei 5 months ago
So the things I have seen in generative AI art lead me to believe there is more complexity than that. Ask it do a scifi scene inspired by Giger but in the style of Van Gough. Pick 3 concepts and mash them together and see what it does. You get novel results. That is easy to undert5stand because it is visual.
Language is harder to parse in that way. But I have asked for Haiku about cybersecurity, work place health and safety documents in Shakespearean sonnet style etc. Some of the results are amazing.
I think actual real creativity in art, as opposed to incremental change or combinations of existing ideas, is rare. Very rare. Look at style development in the history of art over time. A lot of standing on the shoulders of others. And I think science and reasoning are the same. And that's what we see in the llms, for language use.
- z3c0 5 months ago
  
  There is plenty more complexity, but that emerges more from embedding, where the less superficial elements of information (such as syntactic dependencies) allow the model to hone in on the higher-order logic of language.
  e.g. when preparing the corpus, embedding documents and subsequently duplicating some with a vec where the tokens are swapped with their hex repr could allow an LLM to learn "speak hex", as well as intersperse the hex with the other languages it "knows". We would see a bunch of encoded text, but the LLM would be generating based on the syntactic structure of the current context.