Comment by cb321
1 year ago
I was curious if "approximate was even faster than exact" at the time in 1975..78. So, I learned a bit how to get a v7 booting on a SIMH pdp11 simulator. I actually got an ordered merge against @Paul_Houle's delta-against-the-prefix encoded format mention to run faster than the v7 spell. Here is a little transcript:
$ time zk
Total: 500400000
real 2.0 user 2.2 sys 0.1
$ echo suw = 1st 5500 sorted-unique words of Tale Of Two Cities
$ wc < suw
1531 1531 11043
$ time comm -23 - sowpodsAIO < suw >/n
real 3:04.0 user 2:28.2 sys 35.4
$ time ni0 sowpodsAIO.pd < suw.pd > /n
real 2:01.0 user 1:48.5 sys 11.4
$ time ni1 sowpodsAIO.pd < suw > /n
real 1:41.0 user 1:29.1 sys 11.3
$ time ni2 sowpodsAIO.pd < suw > /n
real 1:26.0 user 1:14.1 sys 11.5
$ time ni3 sowpodsAIO.pd < suw > /n
real 1:04.0 user 53.2 sys 10.3
$ time ni4 sowpodsAIO.pd < suw > /n
real 1:04.0 user 53.3 sys 10.1
$ time ni5 sowpodsAIO.pd < suw >/n
real 45.0 user 34.4 sys 10.1
$ time spell < suw > /n
/bin/spell: /usr/dict/spellhist: cannot create
real 46.0 user 1.4 sys 1.4
$ wc answer.spell
78 78 604 answer.spell
$ wc answer.ni2
35 35 225 answer.ni2
zk is Bob Supnik's classic PDP-11 performance test. `comm -23` is the slow part of the v6 Unix spell. `ni0.c` is the same as the above `notIn.c`. `ni1.c` specializes to have stdin be non-prefix-delta but the dictionary still prefix-delta and does the simple optimization of not recomparing already compared prefix bytes. `ni2.c` does the remaining simple (also not English-specific) optimization of skipping over long words part of an "increasing run" like "abbreviate abbreviated abbreviates abbreviating .." on the way to "abide", say. `ni3.c` ditches stdio buffering for the dictionary in favor of its own to play with the buffer size since v7 could only do BUFSIZ=512 bytes. `ni4.c` does the same for the stdin stream and also eliminates the need for a follow-on strlen (like the modern getline/getdelim). Finally, `ni5.c` manually inlines the character reading loop into the macro to get the next dictionary word. In real life, there would probably be 2-4 seconds of IO time added for all but the `comm` variant (which needs 2.7 MB IO not 870 kB).
For posterity, that ni5.c outperforming v7 spell (not counting IO) in 1979 v7 C is this:
#include <stdio.h> /* Emit stdin words not in /t/dict.pd */
extern int errno; /* ni dict.pd < sort-uWds */
#ifndef Z
#define Z 2048
#endif
char buf[Z]; char *b = buf; int n = 0;
fillbuf(fd) int fd; {
int m = read(fd, buf, Z);
if (m > 0) { b = buf + 1; n = m - 1; return buf[0]; }
return -1;
}
#define chGet(fd) (n-- > 0 ? *b++ : fillbuf(fd))
char bufI[Z]; char *bI = bufI; int In = 0;
filbufI() { /* symbols must be <= 7 chars long */
int m = read(0, bufI, Z);
if (m > 0) { bI = bufI + 1; In = m - 1; return bufI[0]; }
return -1;
}
#define chGetI (In-- > 0 ? *bI++ : filbufI())
readw(wI, max) char *wI; int max; {
register int i;
register char c;
for (i = 0; i < max; i++) {
c = chGetI;
if (c != '\n' && c != -1)
wI[i] = c;
else
break;
}
return i;
}
main(ac, av) int ac; char **av; {
int fD = open(av[1], 0);
char wI[64], wD[16], nPS; /* Word buffers, lens, then.. */
int nI=0, nD=0, oI=1, oD=1; /* ..a flag sez read was Ok. */
int m, nS, nP, i; /* Running num. matching prefix chars */
if (!fD){fprintf(stderr, "%s: %d\n", av[1],errno);return 1;}
#define DGET /* Update [wno]D with its next word */ \
if ((oD = oD && (nPS = chGet(fD)) != -1)) { \
/* pcc needs & 255 here| , but runs ~8% slower */ \
nP = ((unsigned)nPS)>> 4; nS = ((unsigned)nPS) & 15; \
nD = nP + nS; \
for (i = nP; i < nD; i++) wD[i] = chGet(fD); }
#define IGET /* Update [wno]I with its next word */ \
m = 0; /* New inp word => reset cmp to start */ \
oI = oI && (nI = readw(wI, sizeof(wI)));
#define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); IGET }
IGET DGET /* Set Theory "not in": ordered merge impl */
while (oI && oD) {
register int c = 0, lim = nI < nD ? nI : nD;
if (nP < m) m = nP;
for (/**/; c == 0 && m < lim; m++) c = wI[m] - wD[m];
m--;
if (c == 0)
c = nI - nD;
if (c < 0) PUT /* I<D: Emit&AdvanceI */
else if (c == 0) { IGET DGET } /* I==D: advance both */
else { /* I>D: advance D to..*/
if (nD < nI) { /*..a maybe <= word. */
DGET /* Loop & compare more */
} else { /* Cannot possibly be <= while nP > m */
do { DGET } while (oD && nP > m);
} /* The above trickery elides like 95% memcmp */
}
}
while (oI) PUT /* flush tail out */
return 0;
}
It's actually not so bad of a C program, especially if you compare it to Doug McIlroy's A) much larger and B) approximate and C) very English-specific program. While I'm sure there is a little bit of time in pre-processing for v7spell that could be elided, I feel I've confirmed the idea (mentioned in McIlroy1982, actually, at the end of HASHING & before SUPERIMPOSED CODES) that an exact approach could have been fine.
McIlroy 82 seemed aware of this, mentioning the Morris-Thompson way as [11], but I'm not sure he was aware of the pretty severe trade-off between more compacting data compression ideas and how fast they are to decode. As you can perhaps see from this program the prefix-delta fits very naturally into the actual data flow of the algorithm and even informs how to intelligently skip a lot of data comparison. It might be possible to do that even better - I didn't think too long about that.
Of course, at some scale dictionary and some scale inputs the loop-over-words will beat whole-document-at-once, but 5500 words (chosen to match a statement in McIlroy82) is not that big. This post alone is about 1250. And 270e3 words is actually a large unabridged dictionary, bigger in fact than the "100-200K words" McIlroy speculated upon at the time.
Since run time is clearly pretty directly proportional to dictionary size, it's fair to say a 150kw dictionary would have been about twice as fast as v7spell and exact and human-language adaptable (at the 5500 word scale or moving the break-even to a lower 2750-ish word document size). Had this idea actually been used at the time, I imagine there would have been pressure to shrink that 870kB encoded size to fit on a 360K floppy drive which would have been 43% the size (and the time, when rehoused on a Winchester drive, though a 2..3 floppy install would also not have been too crazy). Even in the modern age over at https://github.com/wolfgarbe/SymSpell it seems 82,765 words or 31% the size (and run time) are deemed acceptable. So, either triple the v7 spell speed@5500 or an 1800 word break-even seem feasible. (EDIT: And, of course, the preprocessor could always select which of the two is fastest, if you want, based on how many unique words there are)
No comments yet
Contribute on Hacker News ↗