Comment by efreak

5 days ago

> Why the apparently arbitrary numbers, I'm not sure, but Claude and ChatGPT both claim the codes were simply drawn from a more general-purpose sequence of product numbers used at IBM at the time.

Claude and chatgpt are (probably) wrong. Wikipedia has 3 citations for the following statement:

> Originally, the code page numbers referred to the page numbers in the IBM standard character set manual

The reason they're so high is because code pages were assigned to EBCDIC first.

1 comment

efreak

Sharlin 5 days ago

Yeah, I later found that quote on Wikipedia too. Though I don't think the cited source is super reliable either, or just folklore ("Oh, 'code page' refers to actual deadtree pages"). All the IBM documentation I could find showed big gaps in the sequence of code pages.

But I just now found the list at [1], I don't know why I didn't notice it before. It's certainly comprehensive! There's been some real detective work to be done in compiling that list. The gaps are much smaller, though still exist, eg. from 40 to 251. The 300s are rather sparse, there are only a few 4xx codes, and then there's a jump from 500 to 8xx (with some 7xx assigned later I think).

In any case, I agree that the LLMs seem to have hallucinated the "more general sequence" part. The code page IDs, or more formally CCSIDs, always were a specific set of 16-bit ID numbers. Why exactly the various gaps exist is probably lost in history by now, if there ever even were any particular reasons.

[1] https://en.wikipedia.org/wiki/Code_page