Comment by greggyb

3 years ago

> Parenthetical type grammar with an explicit start character and end character is pivotal for encoding information unambiguously.

You argue against multiple types of dashes because context is sufficient, despite there being typographical ambiguity. But you insist that we must have typographically unambiguous bracket characters. I must admit that I am struggling in this conversation to determine when we can depend on context and when we need unambiguous markers. Perhaps I am just incapable of picking up on the subtle context that backs up this position of yours. (:

> everything beyond that is strawmanning me

In fact, you will find examples of real human languages that exhibit more extreme versions of the things I have suggested.

FOREXAMPLELATINWASORIGINALLYWRITTENINASINGLECASEWITHNOSPACESBETWEENWORDS SENTENCESWERESEPARATEDBYASINGLESPACE OBVIOUSLYALLOFTHEPUNCTUATIONISUNNECESSARY SOALLARGUMENTSABOUTTYPOGRAPHYOTHERTHANTHATOFFONTSAREBASEDINREALITY

There are languages with simpler tense systems than what English has. Slavic languages, for example, tend not to have a pluperfect. So, the example of removing tenses is based in reality.

Hawaiian has an alphabet of just 13 letters. So, removing letters from the 26 in the English alphabet is based in reality.

The Dictionnaire de l'Académie française is being updated to its 9th edition and is expected to have ~60K words[0], whereas English dictionaries report an order of magnitude more[1] (even with the issues in the linked source, this is a large gap). Basic English[2] has a vocabulary of less than 1,000 words (if you desire a vast overhaul of the existing norms of typography, I hope that you are at least willing to entertain prior art in the area of overhauling the use of natural language as a valid example, even if you disagree with the intention or outcome). If you wanted me to go to extremes (which again, I did not in the post you replied to), I could have just suggested we use Toki Pona. Of course, if I did suggest such a conlang, you may have been correct that I was strawmanning you and going to extremes just for a point. Nevertheless, we can definitely conclude that there are, in fact, natural human languages with substantially fewer words than modern English, and there are definitely constructed and artificially restricted natural languages with enormously smaller vocabularies.

You need not agree that these examples constitute best practice, or that they represent desirable goals in the continued evolution of language and written communication. I hope, though, that you can recognize that none of these are strawmen, but based in reality, many in natural languages, and some in artificially constrained natural languages for specific purposes. If anything, I presented examples that do not represent the extremes of any position (I could easily have brought up languages with no written representation, for example). I merely selected additional examples that conform to a broad categorization of removing stuff from modern English.

I welcome further discussion on the topic, but I worry you might dismiss things I say you disagree with, as you have done once above by ascribing an intention of strawmanning you, and as you seem wont to do with typographical conventions you dislike. And if you want to eliminate the punctuation you dislike, what might you do to a person whose arguments you dismiss? (;

It seems though, that you just don’t like the various dashes, which is totally fine. Many other people and I find value in them. Still more probably just go along because, as I said, a big part of language norms comes from inertia. The point of language (other than perhaps some, but not all, artistic expression) is communication. Why abandon the norms that facilitate this communication? Is it better to stand on preference (or perhaps principle) and harm your attempts at communication or to yield to norms and be better understood (though perhaps annoyed)? I do not know that there is a correct answer to this question.

I do hope, though, that I have disabused you of the fanciful notions that I was cherry-picking ideas that are extreme just to prove a point and that I was strawmanning your argument. I have shown above numerous examples that back up each of my suggestions, grounded in the reality of natural human languages. Further, I have shown several examples that are truly extreme to show that my original suggestions were not “intentionally extreme to prove a point.”

[0] https://www.thoughtco.com/academie-francaise-1364522 [1] https://www.merriam-webster.com/help/faq-how-many-english-wo... [2] https://simple.wikipedia.org/wiki/Basic_English

I don't care about multiple types of parenthesis per se, I do care about there being a spanning set of grammatical constructs. I don't think period and comma alone would be enough. You need to have constructs for compressing and abstracting. "John/Paul/Ringo/George were in the Beatles." Notice how I just made 4 sentences for the price of one. I could have written "John was in the beatles", "Paul was in the Beatles" ... all four statements fully unrolled. You need constructs which let you FOIL sentence structure just like in math class, presenting (option A, B and C) to (you, and everyone else). You also need a handful of "client server type" interaction structures. Header information. A thing to indicate if the content is a question, request, demand, greeting etc. Grammar is not about encoding literal speech pausing, its about encoding how to deserialize the linear sequence of words.

In theory you could just make "(" and ")" the universal sub-context denoting symbol. You would just need a different extra symbol to clarify between what a parenthesis means. The three systems makes sense. One for data agnostic compression like a JSON object / foiling a math expression, one for relaying text itself as an object in the domain of discussion rather than as the thing being said (aka a "quotation"), and one for scopes that are part of the discussion per se (not quotation).

Context suffices when the parts of speech have no chance of being in the same slot. Compound words and numbers.. your machine screw example was pretty rare. I think the dashes are too specialized in meaning and too hard to tell apart to justify code points in the docs and buttons on my keyboard. If need be, distinguish the various flavors of hyphen with some rule about touching the letter or having two in a row. Our symbol set is reasonable. Not as succinct as Hawaiian, not so bloated as Chinese. 13 chars fits in 4 bits. 26 chars fits in 5. With great strain you can maybe find a workable set of grammatical symbols without blowing past 32 chars, but will probably end up using a 6th. I'm against bloating the raw number of symbols and rules everyone has to rote learn, not dashes in particular. If its already in frequent use like all the paren styles then fine, but lets not make anything worse than it has to be.

  • > Grammar is not about encoding literal speech pausing,

    This is absolutely correct.

    > its about encoding how to deserialize the linear sequence of words

    This is absolutely incorrect. Grammar is the collection of rules that prescribes the combination of words to make valid collections of the same in a language. Specifically, grammar is distinct from semantics, which is concerned with meaning. A nonsense statement may be grammatically correct.

    Punctuation is the collection of non-character glyphs that are used to capture the nuances of spoken language into a written form.

    Punctuation is orthogonal to grammar.

    Put more briefly: spoken language has grammar and no punctuation; written language has the same grammar as the same spoken language and also punctuation.

    Parenthetical asides are represented in spoken language with some combination of marker words, pauses, tone of voice, word choice, and perhaps other indicators I may have forgotten. The purpose of punctuation is to lend some of the nuance of spoken communication to the otherwise sparse written word.

    The argument of the number of bits to encode glyphs is also orthogonal to the purpose or usefulness of language, writing, and communication. Computers are tools. A keyboard should justify the paucity of its glyphs, rather than the other way around. Once we get here, we are in the realm of pure opinion and preference, which I don't have much interest in pursuing.

    • > A nonsense statement may be grammatically correct.

      I'm sure then you already know, colorless green ideas sleep furiously.

      I see no contradiction in what you are calling incorrect. At some point whatever representation our brain uses for concepts and thoughts, to share that object requires us to pack into a linear sequence of words which can then be reliably unpacked by on the other side. The very nature of verbal communication forces the existence of serialization/deserialization rules. Those rules are what we call grammar. Grammar may be somewhat orthogonal to semantics, as you observe it is possible to encode valid nonsense, but the grammar exists to encode semantics and is thus to some degree tied to it. The grammar rule of "subject verb object" doesn't only tell you how to check the validity of "colorless dreams sleep furiously", it tells you how to deserialize that sentence back into a hierarchy tree of constituents and their relations. It just so happens to unpack as an object of useless constituents and impossible relations.

      Punctuation maybe orthogonal to grammar in the general case, but in this particular language they are highly coincident. Virtually all punctuation marks are grammatical particles. It doesn't have to be like this. Some languages have "audible parenthesis" words. Others have words for marking the end of a sentence as a question. Calling punctuation marks a non character seems a bit artificial. Let's just call them the non-audible characters, in analogy with non-printable characters.

      The argument about bits was apparently lost in transmission. I assure you this isn't a preference and opinion thing. Information theory applies just as well to natural language encodings as it does to computer protocols. The basic principles of information entropy and optimal transmission encoding shows up in every language: the least frequently used words are the longest. In an analysis of conversations across languages, researchers found the bit rate to be constant. Some spoken languages are seemingly very fast, but that's because the information density per word is lower. The brains bit rate is a constant. Irrespective of if we are using a computer or not, the size of an alphabet is measured in bits. The bits in the alphabet determine how much you can possibly say per character. On an extreme end, Chinese has over 5000 characters. That's around 13 bits of information per character, at the low low cost of memorizing all of them. For comparison, ignoring capitalization and punctuation, English is a 5 bit alphabet meaning the same amount of information fits into 3 letter words. The Hawaiian alphabet can cover 80% of those possibilities with just 3 letters, and the remainder with a 4th. Think about how powerful that is. Is memorizing 5000 arbitrary squiggles worth it to compress the width of words down by ~3 chars?

      The number of bits that are in an alphabet also determines the minimum number of unique design elements needed to construct letters for it. 7 segment displays are a great example. As I said, our characters fit on 5 bits. That's the minimum. Now when our letters came about, they didn't know about bits and they certainly weren't doing this on purpose, but almost every letter can be expressed on a 7 segment display. In other words, writing a letter only wastes two bits per character relative to saying it.

      When you learn a new ligature in an Arabic script, you've doubled the number of letters you know. When you learn a new Chinese character, you've learned a new Chinese character. Language is a transmission medium. Its a tool. My takes here are no more preference and opinion than the allocation of the radio spectrum. There's an optimization tradeoff to be had between the limited character choices of Hawaiian and the extreme rote memorization of Chinese. Going from 13 to 26 characters does double the learning time, but the learning time at that stage was short anyway. Going from what we currently have to perhaps 60ish characters (6 bits) doubles it again. Maybe that's tolerable. The next step up is a ~128 characters. There may be things you can do quicker with a large set of symbols, but the ROI for learning all those symbols doesn't pay off. Around 5 to 6 bits is where most writing systems settle.

      And that's why bloating the raw glyph table with letters and marks is the wrong solution.