Comment by BruceEel

4 years ago

Walter, D has conditional compilation, versioning and CTFE without preprocessor so I guess that covers the 99% "sane" functionality. Where do you draw the line between that and the 1% abomination part, i.e. your thoughts on, say, compile time type introspection and things like generating ('printing') types/declarations?

The abomination is using the preprocessor to redefine the syntax and/or invent new syntax. Supporting identifier characters that look like `:` is just madness.

Of course, I've also opined that Unicode supporting multiple encodings for the same glyph is also madness. The Unicode people veered off the tracks and sank into a swamp when they decided that semantic information should be encoded into Unicode characters.

  • What other kind of difference should be encoded into Unicode characters? For example, the glyphs for the Latin a and the Cyrillic а, or the Latin i and the Cyrillic (Ukrainian, Belarusian, and pre-1918 Russian) і look identical in practically every situation, and the Latin (Turkish) ı and the Greek ι aren’t far off. At least not far off compared to the Cyrillic (most languages) д and the Cyrillic (Southern) g-like version (from the standard Cyrillic cursive), or the Cyrillic т and the several Cyrillic (Southern) versions that are like either an m or a turned m (from the cursive, again). Yet most people who are acquainted with the relevant languages would say the former are different “letters” (whatever that means) and the latter are the same.

    [Purely-Latin borderline cases: umlaut (is not two dots in Fraktur) vs diaeresis (languages that use it are not written in Fraktur), acute (non-Polish, points past the letter) vs kreska (Polish, points at the letter). On the other hand, the mathematical “element of” sign was still occasionally typeset as an epsilon well into the 1960s.]

    Unicode decides most of these based on the requirement to roundtrip legacy encodings (“have these been ever encoded differently in the same encoding?”), which seems reasonable, yet results in homograph problems and at the same time the Turkish case conversion botch. In any case, once (sane) legacy encodings run out but you still want to be consistent, what do you base the encoding decisions on but semantics? (On the other hand, once you start encoding semantic differences, where do you stop?..) You could do some sort of glyph-equivalence-class thing, but that would still give you no way to avoid unifying a and аeveryone who writes both writes them the same.

    None of this touches on Unicode “canonical equivalence”, but your claim (“Unicode supporting multiple encodings for the same glyph is [...] madness”) covers more than just that if I understood it correctly. And while I am attacking it in a sense, it’s only because I genuinely don’t see how this part could have been done differently in a major way.

    • It's a good question. The answer is straightforward. Let's say you saw `i` in a book. How would you know if it is Latin or Cryillic?

      By the context!

      How would a book distinguish `a` as in `apple` from `a` as in `a+b`? (Unicode has a separate letter a from a math a.)

      By the context!

      This is what I meant by Unicode has no business adding semantic content. Semantics come from context, not from glyph. After all, what if I decided to write:

      (a) first bullet point

      (b) second bullet point

      Now what? Is that letter a or math symbol a? There's no end to semantic content. It's impossible to put this into Unicode in any kind of reasonable manner. Trying to do it leads one into a swamp of hopelessness.

      BTW, the attached article is precisely about deliberately misusing identical glyphs in order to confuse the reader because the C compiler treats them differently. What better case for semantic content for glyphs being a hopelessly wrongheaded idea.

    • I'm obviously not Walter, but I have a succinct answer that may upset a few people, but avoids a lot of confusion at the same time.

      The idea of a letter in an alphabet and a printable glyph for that letter are two different ideas. Unicode could have and probably should have had a two-layer encoding where the letters are all different but an extra step resolves letters to glyphs. Where one glyph can represent more than one letter, a modifier can be attached to represent the parent alphabet so no semantic information is lost. Comparison for "same character" would be at the glyph level without modifiers, and we could have avoided a bunch of different Unicode equivalence testing libraries that have to be individually written, maintained, and debugged. Use in something like a spell checker, conversion to other character sets, or stylization like cursive could have used the glyph and source-language modifier both.

      3 replies →

    • Ignoring Unicode and focusing just on C: if the glyph matches a glyph used in any existing C operator maybe it shouldn't be legal as an identifier character.

      1 reply →

  • That ship sailed long before Unicode. Even ASCII has characters with multiple valid glyphs (lower case a can lose the ascender, and lower case g is similarly variable in the number of loops), not to mention multiple characters that are often represented with the same glyph (lower case l, upper case I, digit 1).

    • That's a font issue with some fonts, not a green light for blessing multiple code points with the exact same glyph.

      In fact, having a font that makes l I and 1 indistinguishable is plenty of good reason to NOT make this a requirement.

  • > The Unicode people veered off the tracks and sank into a swamp when they decided that semantic information should be encoded into Unicode characters.

    As if that weren't enough, they also decided to cram half-assed formatting into it. You got bold letters, italics, various fancy-style letters, superscripts and subscripts for this and that.. all for the sake of leagacy compatibility. Unicode was legacy right from the beginning.

    • The "fonts" in Unicode are meant to be for math and scientific symbols, and not a stylistic choice. Don't use them for text, as it can be a cacophony in screen readers.

      Unicode chose to support lossless conversion to and from other encodings it replaces (I presume it was important for adoption), so unfortunately it inherited the sum of everyone else's tech debt.

    • Unicode did worse than that. They added code points to esrever the direction of text rendering. Naturally, this turned out to be useful for injecting malware into source code, because having the text rendered backwards and forwards erases the display of the malware, so people can't see it.

      Note that nobody needs these code points to reverse text. I did it above without gnisu those code points.