← Back to context

Comment by dhosek

18 hours ago

The complexities of mixed LR and RL text are quite astonishing since it’s not really even a case of just switching modes when switching scripts since double-nested (or more) texts can change the semantics of line breaks. This article provides a good overview: https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf [1]

In college [2], when I wanted to quote some texts from Exodus in Hebrew in a paper that I wrote, I ended up avoiding the issue by hand-reversing the letter order and manually breaking lines. 8 bits is insufficient to cover all the possible combinations of letters and vowel markings so the font didn’t include any vowel markings and only did dageshim for בּ and פּ if I recall correctly.

1. As an aside, it would have been really nice if Unicode provided a R-L mirrored Latin alphabet to make it easier for monolingual developers to grasp the complexities surrounding mixed directional typesetting. I suppose it could still be added, although Unicode tends towards conservatism on adding additional characters.

2. This was 1990, well before Unicode in the era of a hundred or so 8-bit character encodings, most of which were not implemented widely. I also had to type the text using the arbitrary ASCII-Hebrew mapping of the font I was using which, among other things, led me to discover that letter frequency in Hebrew is much more uniform than it is in English.