← Back to context

Comment by cb321

1 year ago

I was curious if "approximate was even faster than exact" at the time in 1975..78. So, I learned a bit how to get a v7 booting on a SIMH pdp11 simulator. I actually got an ordered merge against @Paul_Houle's delta-against-the-prefix encoded format mention to run faster than the v7 spell. Here is a little transcript:

    $ time zk
    Total: 500400000
    real        2.0 user        2.2 sys         0.1
    $ echo suw = 1st 5500 sorted-unique words of Tale Of Two Cities
    $ wc < suw
       1531   1531   11043
    $ time comm -23 - sowpodsAIO < suw >/n
    real     3:04.0 user     2:28.2 sys        35.4
    $ time ni0 sowpodsAIO.pd < suw.pd > /n
    real     2:01.0 user     1:48.5 sys        11.4
    $ time ni1 sowpodsAIO.pd < suw > /n
    real     1:41.0 user     1:29.1 sys        11.3
    $ time ni2 sowpodsAIO.pd < suw > /n
    real     1:26.0 user     1:14.1 sys        11.5
    $ time ni3 sowpodsAIO.pd < suw > /n
    real     1:04.0 user       53.2 sys        10.3
    $ time ni4 sowpodsAIO.pd < suw > /n
    real     1:04.0 user       53.3 sys        10.1
    $ time ni5 sowpodsAIO.pd < suw >/n
    real       45.0 user       34.4 sys        10.1
    $ time spell < suw > /n
    /bin/spell: /usr/dict/spellhist: cannot create
    real       46.0 user        1.4 sys         1.4
    $ wc answer.spell
         78     78     604 answer.spell
    $ wc answer.ni2
         35     35     225 answer.ni2

zk is Bob Supnik's classic PDP-11 performance test. `comm -23` is the slow part of the v6 Unix spell. `ni0.c` is the same as the above `notIn.c`. `ni1.c` specializes to have stdin be non-prefix-delta but the dictionary still prefix-delta and does the simple optimization of not recomparing already compared prefix bytes. `ni2.c` does the remaining simple (also not English-specific) optimization of skipping over long words part of an "increasing run" like "abbreviate abbreviated abbreviates abbreviating .." on the way to "abide", say. `ni3.c` ditches stdio buffering for the dictionary in favor of its own to play with the buffer size since v7 could only do BUFSIZ=512 bytes. `ni4.c` does the same for the stdin stream and also eliminates the need for a follow-on strlen (like the modern getline/getdelim). Finally, `ni5.c` manually inlines the character reading loop into the macro to get the next dictionary word. In real life, there would probably be 2-4 seconds of IO time added for all but the `comm` variant (which needs 2.7 MB IO not 870 kB).

For posterity, that ni5.c outperforming v7 spell (not counting IO) in 1979 v7 C is this:

    #include <stdio.h>  /* Emit stdin words not in /t/dict.pd */
    extern int errno;   /* ni dict.pd < sort-uWds */
    #ifndef Z
    #define Z 2048
    #endif
    char buf[Z]; char *b = buf; int n = 0;
    fillbuf(fd) int fd; {
        int m = read(fd, buf, Z);
        if (m > 0) { b = buf + 1; n = m - 1; return buf[0]; }
        return -1;
    }
    #define chGet(fd) (n-- > 0 ? *b++ : fillbuf(fd))
    char bufI[Z]; char *bI = bufI; int In = 0;
    filbufI() { /* symbols must be <= 7 chars long */
        int m = read(0, bufI, Z);
        if (m > 0) { bI = bufI + 1; In = m - 1; return bufI[0]; }
        return -1;
    }
    #define chGetI (In-- > 0 ? *bI++ : filbufI())
    readw(wI, max) char *wI; int max; {
        register int  i;
        register char c;
        for (i = 0; i < max; i++) {
            c = chGetI;
            if (c != '\n' && c != -1)
                wI[i] = c;
            else
                break;
        }
        return i;
    }
    main(ac, av) int ac; char **av; {
        int  fD = open(av[1], 0);
        char wI[64], wD[16], nPS;   /* Word buffers, lens, then.. */
        int nI=0, nD=0, oI=1, oD=1; /* ..a flag sez read was Ok. */
        int m, nS, nP, i;   /* Running num. matching prefix chars */
        if (!fD){fprintf(stderr, "%s: %d\n", av[1],errno);return 1;}
    #define DGET    /* Update [wno]D with its next word */ \
        if ((oD = oD && (nPS = chGet(fD)) != -1)) { \
            /* pcc needs & 255 here| , but runs ~8% slower */ \
            nP = ((unsigned)nPS)>> 4; nS = ((unsigned)nPS) & 15; \
            nD = nP + nS; \
            for (i = nP; i < nD; i++) wD[i] = chGet(fD); }
    #define IGET    /* Update [wno]I with its next word */ \
        m = 0;      /* New inp word => reset cmp to start */ \
        oI = oI && (nI = readw(wI, sizeof(wI)));
    #define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); IGET }
        IGET DGET   /* Set Theory "not in": ordered merge impl */
        while (oI && oD) {
            register int c = 0, lim = nI < nD ? nI : nD;
            if (nP < m) m = nP;
            for (/**/; c == 0 && m < lim; m++) c = wI[m] - wD[m];
            m--;
            if (c == 0)
                c = nI - nD;
            if (c < 0) PUT                  /* I<D: Emit&AdvanceI */
            else if (c == 0) { IGET DGET }  /* I==D: advance both */
            else {                          /* I>D: advance D to..*/
                if (nD < nI) {              /*..a maybe <= word. */
                    DGET    /* Loop & compare more */
                } else {    /* Cannot possibly be <= while nP > m */
                    do { DGET } while (oD && nP > m);
                } /* The above trickery elides like 95% memcmp */
            }
        }
        while (oI) PUT                      /* flush tail out */
        return 0;
    }

It's actually not so bad of a C program, especially if you compare it to Doug McIlroy's A) much larger and B) approximate and C) very English-specific program. While I'm sure there is a little bit of time in pre-processing for v7spell that could be elided, I feel I've confirmed the idea (mentioned in McIlroy1982, actually, at the end of HASHING & before SUPERIMPOSED CODES) that an exact approach could have been fine.

McIlroy 82 seemed aware of this, mentioning the Morris-Thompson way as [11], but I'm not sure he was aware of the pretty severe trade-off between more compacting data compression ideas and how fast they are to decode. As you can perhaps see from this program the prefix-delta fits very naturally into the actual data flow of the algorithm and even informs how to intelligently skip a lot of data comparison. It might be possible to do that even better - I didn't think too long about that.

Of course, at some scale dictionary and some scale inputs the loop-over-words will beat whole-document-at-once, but 5500 words (chosen to match a statement in McIlroy82) is not that big. This post alone is about 1250. And 270e3 words is actually a large unabridged dictionary, bigger in fact than the "100-200K words" McIlroy speculated upon at the time.

Since run time is clearly pretty directly proportional to dictionary size, it's fair to say a 150kw dictionary would have been about twice as fast as v7spell and exact and human-language adaptable (at the 5500 word scale or moving the break-even to a lower 2750-ish word document size). Had this idea actually been used at the time, I imagine there would have been pressure to shrink that 870kB encoded size to fit on a 360K floppy drive which would have been 43% the size (and the time, when rehoused on a Winchester drive, though a 2..3 floppy install would also not have been too crazy). Even in the modern age over at https://github.com/wolfgarbe/SymSpell it seems 82,765 words or 31% the size (and run time) are deemed acceptable. So, either triple the v7 spell speed@5500 or an 1800 word break-even seem feasible. (EDIT: And, of course, the preprocessor could always select which of the two is fastest, if you want, based on how many unique words there are)