Comment by cb321
1 year ago
Actually, @probably_wrong was actually right above. It took a while, but I found Bentley 1986 Programming Pearls Column 13 describes Kernighan & Plaugher 1981 as describing the original v6 spell in more detail as indeed the smart merge technique:
concat ..|translit A-Z a-z|translit ^a-z @n|sort|unique|common -2 wds
For the curious, in modern parlance that Unix v6 spell is:
tr A-Z a-z | tr -c a-z \\n | sort -u | comm -2 -3 - someDict
So, A) mea culpa & this corrects the record and B) my measurements above suggest that @Paul_Houle's delta encoding idea alone would have given ~4X IO scan reduction on unabridged with no word stemming accuracy trade-offs. { Delta coding is, of course, combinable with word stemming in the same way as McIlroy's ideas since "notIn" is just a set operator. You just would have needed to sort -u after stemming.. maybe a stem.c to replace the two trs. } So, I still don't think you needed large Random Access Memory.
I was curious if "approximate was even faster than exact" at the time in 1975..78. So, I learned a bit how to get a v7 booting on a SIMH pdp11 simulator. I actually got an ordered merge against @Paul_Houle's delta-against-the-prefix encoded format mention to run faster than the v7 spell. Here is a little transcript:
zk is Bob Supnik's classic PDP-11 performance test. `comm -23` is the slow part of the v6 Unix spell. `ni0.c` is the same as the above `notIn.c`. `ni1.c` specializes to have stdin be non-prefix-delta but the dictionary still prefix-delta and does the simple optimization of not recomparing already compared prefix bytes. `ni2.c` does the remaining simple (also not English-specific) optimization of skipping over long words part of an "increasing run" like "abbreviate abbreviated abbreviates abbreviating .." on the way to "abide", say. `ni3.c` ditches stdio buffering for the dictionary in favor of its own to play with the buffer size since v7 could only do BUFSIZ=512 bytes. `ni4.c` does the same for the stdin stream and also eliminates the need for a follow-on strlen (like the modern getline/getdelim). Finally, `ni5.c` manually inlines the character reading loop into the macro to get the next dictionary word. In real life, there would probably be 2-4 seconds of IO time added for all but the `comm` variant (which needs 2.7 MB IO not 870 kB).
For posterity, that ni5.c outperforming v7 spell (not counting IO) in 1979 v7 C is this:
It's actually not so bad of a C program, especially if you compare it to Doug McIlroy's A) much larger and B) approximate and C) very English-specific program. While I'm sure there is a little bit of time in pre-processing for v7spell that could be elided, I feel I've confirmed the idea (mentioned in McIlroy1982, actually, at the end of HASHING & before SUPERIMPOSED CODES) that an exact approach could have been fine.
McIlroy 82 seemed aware of this, mentioning the Morris-Thompson way as [11], but I'm not sure he was aware of the pretty severe trade-off between more compacting data compression ideas and how fast they are to decode. As you can perhaps see from this program the prefix-delta fits very naturally into the actual data flow of the algorithm and even informs how to intelligently skip a lot of data comparison. It might be possible to do that even better - I didn't think too long about that.
Of course, at some scale dictionary and some scale inputs the loop-over-words will beat whole-document-at-once, but 5500 words (chosen to match a statement in McIlroy82) is not that big. This post alone is about 1250. And 270e3 words is actually a large unabridged dictionary, bigger in fact than the "100-200K words" McIlroy speculated upon at the time.
Since run time is clearly pretty directly proportional to dictionary size, it's fair to say a 150kw dictionary would have been about twice as fast as v7spell and exact and human-language adaptable (at the 5500 word scale or moving the break-even to a lower 2750-ish word document size). Had this idea actually been used at the time, I imagine there would have been pressure to shrink that 870kB encoded size to fit on a 360K floppy drive which would have been 43% the size (and the time, when rehoused on a Winchester drive, though a 2..3 floppy install would also not have been too crazy). Even in the modern age over at https://github.com/wolfgarbe/SymSpell it seems 82,765 words or 31% the size (and run time) are deemed acceptable. So, either triple the v7 spell speed@5500 or an 1800 word break-even seem feasible. (EDIT: And, of course, the preprocessor could always select which of the two is fastest, if you want, based on how many unique words there are)