Comment by jerf
19 hours ago
It looks like Boa has Unicode tables compiled inside of itself: https://github.com/boa-dev/boa/tree/main/core/icu_provider
Brimstone does not appear to.
That covers the vast bulk of the difference. The ICU data is about 10.7MB in the source (boa/core/icu_provider) and may grow or shrink by some amount in the compiling.
I'm not saying it's all the difference, just the bulk.
There's a few reasons why svelte little executables with small library backings aren't possible anymore, and it isn't just ambient undefined "bloat". Unicode is a big one. Correct handling of unicode involves megabytes of tables and data that have to live somewhere, whether it's a linked library, compiled in, tables on disks, whatever. If a program touches text and it needs to handle it correctly rather than just passing it through, there's a minimum size for that now.
Brimstone does embed Unicode tables, but a smaller set than Boa embeds: https://github.com/Hans-Halverson/brimstone/tree/master/icu.
Brimstone does try to use the minimal set of Unicode data needed for the language itself. But I imagine much of the difference with Boa is because of Boa's support for the ECMA-402 Internationalization API (https://tc39.es/ecma402/).
Yeah, the majority of the difference is from the Unicode data for Intl along with probably the timezone data for Temporal.
Is it possible to build Boa without these APIs?
1 reply →
Unicode is everywhere though. You'd think there'd be much greater availability of those tables and data and that people wouldn't need to bundle it in their executables.
Unfortunately operating systems don't make the raw unicode data available (they only offer APIs to query it in various ways). Until they do we all have to ship it seperately.
For some OSes like Windows, some relevant APIs can be indeed used to reconstruct those tables. I found that this is in fact viable for character encoding tables, only requiring a small table for fixes in most cases.
Debian has a unicode-data package, so you can just depend on it.
I just wish we could use system tables for that, instead of bloating every executable with their own outdated copy.
I have no issue with my system using an extra 10mb for Ancient Egyptian capitalization to work correctly. Every single program including those rules is a lot more wasteful.
I was currious to see what that data consisted of and aparently that's a lot of translations, like the name of all possible calendar formats in all possible languages, etc. This seems useless in the vast majority of use cases, including that of a JS interpreter. Looks to me like the typical output of a comitee that's looking too hard to extend its domain.
Disclaimer: I never liked unicode specs.
Unicode is an attempt to encode the world's languages: there is not much to like or dislike about it, it only represents the reality. Sure, it has a number of weird details, butnif anything, it's due to the desire to simplify it (like Han unification or normal forms).
Any language runtime wanting to provide date/time and string parsing functions needs access to the Unicode database (or something of comparable complexity and size).
Saying "I don't like Unicode" is like saying "I don't like the linguistic diversity in the world": I mean sure, OK, but it's still there and it exists.
Though note that date-time, currency, number, street etc. formatting is not "Unicode" even if provided by ICU: this is similarly defined by POSIX as "locales", anf GNU libc probably has the richest collection of locales outside of ICU.
There are also many non-Unicode collation tables (think phonebook ordering that's different for each country and language): so no good sort() without those either.
I am not questionning the goal of representing all the fine details of every possible languages and currencies and calendars in use anywhere at any time in the universe, that's a respectable achievment. I'm discussing the process that lead to a programming language interpreter needing, according to the comment I was replying to, to embed that trove of data.
Most of us are not using computers to represent subtle variants of those cultural artifacts and therefore they should be left in some specialized libraries.
Computers are symbolic machines, after all, and many times we would be as good using only 16 symbols and typing our code on a keyboard with just that many keys. We can't have anything but 64bits floats in JS, but somehow we absolutely need to be able to tell between the "peso lourd argentin (1970–1983)" and the "peso argentin (1881–1970)"? And that to display a chemical concentration in millimole per liter in German one has to write "mmol/l"?
I get it, the symbolic machines need to communicate with humans, who use natural languages written in all kind of ways, so it's very nice to have a good way to output and input text. We wanted that way to not favor any particular culture and I can understand that. But how do you get from there to the amount of arcane specialized minute details in the ICU dataset is questionable.
Does that include emojis?
9 replies →
[flagged]
If someone builds, say, a Korean website and needs sort(), does the ICU monolith handle 100% of the common cases?
(Or substitute for Korean the language that has the largest amount of "stuff" in the ICU monolith.)
Yes, though it's easy to not use the ICU library properly or run into issues wrt normalization etc
As well-defined as Unicode is, surprising that no one has tried to replace ICU with a better mousetrap.
Not to say ICU isn’t a nice bit of engineering. The table builds in particular I recall having some great hacks.
POSIX systems actually have their own approach with "locales" and I it predates Unicode and ICU.
Unfortunately, for a long time, POSIX system were uncommon on desktops, and most Unices do not provide a clean way to extend it from userland (though I believe GNU libc does).