Comment by mananaysiempre

4 years ago

(I expect Walter probably has better things to do than to reply to random guys on the ’net, but we can always hope, and I was curious :) )

First off, Unicode cursive (bold, Fraktur, monospace, etc.) Latin letters are not meant to be styles, they are mathematical symbols. Of course, that doesn’t mean people aren’t going to use them for that[1], and I’m not convinced Unicode should have gotten into that particular can of worms, but I think you can consistently say that the difference between, for example, an italic X for the length of a vector and a bold X for the vector itself (as you could encounter in a mechanics text) is not (just) one of style. Similarly for the superscripts and modifier letters—a [ph] and a [pʰ] or a [kj] and a [kʲ] in an IPA transcription (for which the modifiers are intended) denote very different sounds (granted, ones that are unlikely to be used at the same time by a single speaker in a single language, but IPA is meant to be more general than that).

(Or wait, was this a reply to my point about Russian vs Bulgarian d? The Bulgarian one is not a cursive variant, it’s derived from a cursive one but is a perfectly normal upright letter in both serif and sans-serif, that looks exactly the same as a Latin “single-storey” g as in most sans-serif fonts but never a Latin “double-storey” g as in most serif fonts, and printed Bulgarian only uses that form—barring font problems—while printed Russian never does. I guess you could declare all of those to be variants of one another, even if it’s wrong etymologically, but even to a Cyrillic user who has never been to Bulgaria that would be quite baffling.)

As to your actual point, I don’t think the comparison you describe could be made language-independent enough that you wouldn’t still end up needing to use a language-specific collation equivalence at the same time (which seems to be your implication IIUC). E.g. a French speaker would usually want oe and œ to compare the same but different from o-diaeresis, but a German speaker might (or might not) want oe and o-umlaut to compare the same, while every font renders o-diaeresis and o-umlaut exactly the same. French speakers (but possibly not in every country?) will almost always drop diacritics over capital letters, and Russian speakers frequently turn ё (/jo/, /o/) into е (/je/, /e/) except in a small set of words where there’s a possibility of confusion (the surnames Chebyshev and Gorbachev, which end in -ёв /-of/, are well-known victims of this confusion). Å is a stylistic varisnt of aa in Norwegian, but a speaker of Finnish (which doesn’t use å) would probably be surprised if forced to treat them the same.

And that’s only in Europe—what about Arabic, where positional variants can make (what speakers think of) a single letter look very different. Even in Europe, should σ and ς be “the same glyph”? They certainly have the same phonetic value, and you always have to use one or the other...

Of course, we already have a (font-dependent) codepoint-to-glyph translation in the guise of OpenType shaping, but it’s not particularly useful for anything but display (and even there it’s non-ideal).

[1] https://utcc.utoronto.ca/~cks/space/blog/tech/PeopleAlwaysEx...

2 comments

mananaysiempre

pvg 4 years ago

printed Bulgarian only uses that form

This is a total pedantitangent but I don't think that's actually true. These wikipedia pages don't talk about it directly but I think give a bit of the flavour/related info that suggest it's not nearly that set in stone:

https://bg.wikipedia.org/wiki/%D0%91%D1%8A%D0%BB%D0%B3%D0%B0...

https://bg.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B6%D0%B4...

The second one, in particular, says early versions of Peter I's Civil Script had the g-looking small д, so these variants have been used concurrently for some time.

cestith 4 years ago

I made no mention of collation, alternate compositions, or of fonts. All I'm saying is that Unicode from the beginning could have had capital alpha and capital Latin 'A' been the same glyph with a glyph-part representation and a separate letter-part representation could have made clear which was which. O-with-umlaut and o-with-diareses could have been done the same. Since you've mentioned fonts, I'll carry on through that topic. Rather than having two code points with two different entries in every font, we could have considered the glyph and the parent alphabet as two pieces of data and had one entry in the font for the glyph.