Comment by cyphar
14 hours ago
I went through a Japanese ePUB novel I happened to have on hand (the Japanese translation of 1984) and 65% of the bytes are ASCII bytes. So in this case UTF-16 would end up resulting in something like 53% more bytes (going by napkin math).
You could argue that because it will be compressed (and UTF-16 wastes a whole NUL byte for all ASCII) that the total file-size for the compressed version would be better (precisely because there are so many wasted bytes) but there are plenty of examples where files aren't compressed and most systems don't have compressed memory so you will pay the cost somewhere.
But in the interest of transparency, a very crude test of the same ePUB yields a 10% smaller file with UTF-16. I think a 10% size penalty (in a very favourable scenario for UTF-16) in exchange for all of the benefits of UTF-8 is more than an acceptable tradeoff, and the incredibly wide proliferation of UTF-8 implies most people seem to agree.
No comments yet
Contribute on Hacker News ↗