Comment by IIAOPSW

3 years ago

> A nonsense statement may be grammatically correct.

I'm sure then you already know, colorless green ideas sleep furiously.

I see no contradiction in what you are calling incorrect. At some point whatever representation our brain uses for concepts and thoughts, to share that object requires us to pack into a linear sequence of words which can then be reliably unpacked by on the other side. The very nature of verbal communication forces the existence of serialization/deserialization rules. Those rules are what we call grammar. Grammar may be somewhat orthogonal to semantics, as you observe it is possible to encode valid nonsense, but the grammar exists to encode semantics and is thus to some degree tied to it. The grammar rule of "subject verb object" doesn't only tell you how to check the validity of "colorless dreams sleep furiously", it tells you how to deserialize that sentence back into a hierarchy tree of constituents and their relations. It just so happens to unpack as an object of useless constituents and impossible relations.

Punctuation maybe orthogonal to grammar in the general case, but in this particular language they are highly coincident. Virtually all punctuation marks are grammatical particles. It doesn't have to be like this. Some languages have "audible parenthesis" words. Others have words for marking the end of a sentence as a question. Calling punctuation marks a non character seems a bit artificial. Let's just call them the non-audible characters, in analogy with non-printable characters.

The argument about bits was apparently lost in transmission. I assure you this isn't a preference and opinion thing. Information theory applies just as well to natural language encodings as it does to computer protocols. The basic principles of information entropy and optimal transmission encoding shows up in every language: the least frequently used words are the longest. In an analysis of conversations across languages, researchers found the bit rate to be constant. Some spoken languages are seemingly very fast, but that's because the information density per word is lower. The brains bit rate is a constant. Irrespective of if we are using a computer or not, the size of an alphabet is measured in bits. The bits in the alphabet determine how much you can possibly say per character. On an extreme end, Chinese has over 5000 characters. That's around 13 bits of information per character, at the low low cost of memorizing all of them. For comparison, ignoring capitalization and punctuation, English is a 5 bit alphabet meaning the same amount of information fits into 3 letter words. The Hawaiian alphabet can cover 80% of those possibilities with just 3 letters, and the remainder with a 4th. Think about how powerful that is. Is memorizing 5000 arbitrary squiggles worth it to compress the width of words down by ~3 chars?

The number of bits that are in an alphabet also determines the minimum number of unique design elements needed to construct letters for it. 7 segment displays are a great example. As I said, our characters fit on 5 bits. That's the minimum. Now when our letters came about, they didn't know about bits and they certainly weren't doing this on purpose, but almost every letter can be expressed on a 7 segment display. In other words, writing a letter only wastes two bits per character relative to saying it.

When you learn a new ligature in an Arabic script, you've doubled the number of letters you know. When you learn a new Chinese character, you've learned a new Chinese character. Language is a transmission medium. Its a tool. My takes here are no more preference and opinion than the allocation of the radio spectrum. There's an optimization tradeoff to be had between the limited character choices of Hawaiian and the extreme rote memorization of Chinese. Going from 13 to 26 characters does double the learning time, but the learning time at that stage was short anyway. Going from what we currently have to perhaps 60ish characters (6 bits) doubles it again. Maybe that's tolerable. The next step up is a ~128 characters. There may be things you can do quicker with a large set of symbols, but the ROI for learning all those symbols doesn't pay off. Around 5 to 6 bits is where most writing systems settle.

And that's why bloating the raw glyph table with letters and marks is the wrong solution.

0 comments

IIAOPSW

No comments yet

Contribute on Hacker News ↗