Comment by samat

1 day ago

I feel so sorry for Arabs now, just read that paragraph about everyday experience of trying to write English-Arabic text in the mail or any other editor.

> I have watched senior engineers, fluent in both Arabic and English, give up on writing a long email in Outlook on a Wednesday afternoon because the cursor would not behave, and switch to Arabic-only or English-only because the cognitive cost of fighting the editor exceeded the cost of monolingual phrasing. Actually I remember very well suffering this while using Facebook for the first time in my life, and I could not register; I was very slow typer that when I reached the moment the cursor does this weird thing, I would just stare at it and never progress.

> This is the ordinary experience of writing mixed Arabic-English text in 2026, in every major editor, email client, and chat application I know of. The pettier cousins are everywhere too, and I collect them: a range like 10–20 silently reading as twenty-to-ten, because digits are weak and the dash is neutral; a trailing exclamation mark teleporting to the far end of the line; a password, toggled visible, displaying in an order that does not match what was typed. None of these are anyone's bug, exactly.

My own Cyrillic struggles are nothing in comparison.

9 comments

samat

dhosek 16 hours ago

The complexities of mixed LR and RL text are quite astonishing since it’s not really even a case of just switching modes when switching scripts since double-nested (or more) texts can change the semantics of line breaks. This article provides a good overview: https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf [1]

In college [2], when I wanted to quote some texts from Exodus in Hebrew in a paper that I wrote, I ended up avoiding the issue by hand-reversing the letter order and manually breaking lines. 8 bits is insufficient to cover all the possible combinations of letters and vowel markings so the font didn’t include any vowel markings and only did dageshim for בּ and פּ if I recall correctly.

⸻

1. As an aside, it would have been really nice if Unicode provided a R-L mirrored Latin alphabet to make it easier for monolingual developers to grasp the complexities surrounding mixed directional typesetting. I suppose it could still be added, although Unicode tends towards conservatism on adding additional characters.

2. This was 1990, well before Unicode in the era of a hundred or so 8-bit character encodings, most of which were not implemented widely. I also had to type the text using the arbitrary ASCII-Hebrew mapping of the font I was using which, among other things, led me to discover that letter frequency in Hebrew is much more uniform than it is in English.

teddyh 6 hours ago

RFC 3986 (STD 66) recommends (in appendix C) delimiting URLs in angle brackets to avoid the problem which your link now has. I.e. if you’d written <https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf>¹ there would have been no problem.
kstrauser 13 hours ago
That link’s a 404.
- gus_massa 8 hours ago
  
  The link says
  https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf ¹
  There is no space between pdf and ¹, so the HN server assumes incorrectly that the ¹ is part of the link.
- mschuster91 8 hours ago
  
  Strip the superscript-1 character at the end, I'm surprised HNs link formatter regex detects it as part of the link: https://tug.org/TUGboat/tb08-1/tb17knutmix.pdf

qingcharles 14 hours ago

I was lucky enough to date someone from the Gulf, so I forced myself to understand and read Arabic scripts (but not understand the text), which is a huge bonus when creating multi-lingual designs.

I already had a good understand of CJK scripts, and you'll come across RtL there, with things like tategaki which is both vertical and RtL at the same time (and can include quotes in other languages such as English and Arabic). Here's some lyrics I made in that format for reference:

https://codepen.io/kingcharlesone/pen/GgRXLoM

What peculiarities does Cyrillic text have? I've never learned to convert Cyrillic to Latin.

raphlinus 8 hours ago

I love this question. Basically all scripts have things that make them challenging to render, sometimes little things, sometimes bigger ones.
Cyrillic for Russian is reasonably straightforward, but it's also used for many other languages. The variation in style is particularly notable for Bulgarian[1]. A sophisticated font might have a "loca" table with locale-specific adjustments, but this is not universal yet, for example the issue to add it to Open Sans is still open[2]. To see the differences, try [3] and use the Language dropdown to select Bulgarian.
[1]: https://en.wikipedia.org/wiki/Bulgarian_alphabet
[2]: https://github.com/googlefonts/opensans/issues/114
[3]: https://localfonts.eu/freefonts/traditional-cyrillic-free-fo...

cyberrock 13 hours ago

I recently learned how messed up URLs are in RtL and I'm also eternally grateful to not have to deal with that. Simulated example:

    https://example.com/[Arabic or Hebrew start of sentence]
                                            wrapped_url_part

It seems like there are probably some phishing attacks based on this.

khoirul 1 day ago

Monday's editor does this even for English text!