Comment by mort96
16 hours ago
I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.
In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...
"Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.
5 characters?
I thought it was normally six 6bit characters?
The relevant Wikipedia page (https://en.wikipedia.org/wiki/36-bit_computing)indicates that 6x6 was the most common, but that 5x7 was sometimes used as well.
... However I'm not sure how much I trust it. It says that 5x7 was "the usual PDP-6/10 convention" and was called "five-seven ASCII", but I can't find the phrase "five-seven ASCII" anywhere on Google except for posts quoting that Wikipedia page. It cites two references, neither of which contain the phrase "five-seven ascii".
Though one of the references (RFC 114, for FTP) corroborates that PDP-10 could use 5x7:
To me, it seems like 5x7 was one of multiple conventions you could store character data in a PDP-10 (and probably other 36-bit machines), and Wikipedia hallucinated that the name for this convention is "five-seven ASCII". (For niche topics like this, I sometimes see authors just stating their own personal terminology for things as a fact; be sure to check sources!).
2 replies →
That was true at the system level on ITS, file and command names were all 6 bit. But six bits doesn't leave space for important code points (like "lower case") needed for text processing. More practical stuff on PDP-6/10 and pre-360 IBM played other tricks.