← Back to context

Comment by mort96

16 hours ago

I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.

In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...

[1] https://en.wikipedia.org/wiki/Quoted-printable

[2] https://en.wikipedia.org/wiki/8-bit_clean

"Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.

  • 5 characters?

    I thought it was normally six 6bit characters?

    • The relevant Wikipedia page (https://en.wikipedia.org/wiki/36-bit_computing)indicates that 6x6 was the most common, but that 5x7 was sometimes used as well.

      ... However I'm not sure how much I trust it. It says that 5x7 was "the usual PDP-6/10 convention" and was called "five-seven ASCII", but I can't find the phrase "five-seven ASCII" anywhere on Google except for posts quoting that Wikipedia page. It cites two references, neither of which contain the phrase "five-seven ascii".

      Though one of the references (RFC 114, for FTP) corroborates that PDP-10 could use 5x7:

          [...] For example, if a
          PDP-10 receives data types A, A1, AE, or A7, it can store the
          ASCII characters five to a word (DEC-packed ASCII).  If the
          datatype is A8 or A9, it would store the characters four to a
          word.  Sixbit characters would be stored six to a word.
      

      To me, it seems like 5x7 was one of multiple conventions you could store character data in a PDP-10 (and probably other 36-bit machines), and Wikipedia hallucinated that the name for this convention is "five-seven ASCII". (For niche topics like this, I sometimes see authors just stating their own personal terminology for things as a fact; be sure to check sources!).

      2 replies →

    • That was true at the system level on ITS, file and command names were all 6 bit. But six bits doesn't leave space for important code points (like "lower case") needed for text processing. More practical stuff on PDP-6/10 and pre-360 IBM played other tricks.