Comment by cb321

1 year ago

I have a feeling the algorithm used was not the smart merge mentioned in the grandparent. That "slower" spell was in v6 Unix { https://en.wikipedia.org/wiki/Spell_(Unix) } which came out in 1975 and by 1973 there were already Winchester drives doing 885 kB/s with 25ms seeks { https://en.wikipedia.org/wiki/History_of_IBM_magnetic_disk_d... }. A 250e3 word dictionary with average word length of 10 bytes would have only taken about 2500e3/885e3 = 2.8 seconds to scan and the unique words in most documents in practice would have easily fit in RAM (as mentioned in Doug's 1982 paper). Not great, but not so bad, either. People didn't spell check on every key press for another 10 years. ;-)

Someone probably could look all this up in the "unified version control of all Unix" history, but the way Doug describes it in the 1982 paper linked in the article, it sounds like the v6 spell did a "doc word at a time loop" instead of "sort all doc words at once and merge against a pre-sorted dictionary", and in fact it sounds like Johnson selected a small dictionary to facilitate that instead of "merging".

It's easy to compress a sorted dictionary by turning something like

  a
  ab
  abc

to

  a
  1b
  2c

where the prefix is the number of characters shared with the last word (and might be coded as a byte as opposed to a digit like you're thinking there. This would be a simple form of compression to code up in BASIC unlike Huffman or most LZW variants which involve bit twiddling and maintaining tree data structures... Though I do remember writing SQ and USQ [1] in BASIC)

[1] https://techtinkering.com/articles/compression-and-archiving...

  • In fact, I think this smart merge technique would have been faster (and exact, with no approximations!) than what McIlroy 1982 describes on 1975..1980 hardware in spite of not fitting in memory. The end of that article says it took about 65..77 seconds of CPU on a PDP 11/70 to check a 5500 word document.

    Meanwhile, the approach of the below 3 files (which would need only tiny adaptations to 1975 C) should have taken about 4 seconds at 200 kB/sec IO time (see below). This kind of technique was quite common in the 1960s database community also. So, it's a bit weird for the Bell labs guys to not have known about it, and it shows off the C/Unix system just as well. Here is an example implementation:

    delta.c:

        #include <stdio.h>  /* stdin->out: delta encode sorted lines */
        #include <string.h>
        struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
        
        int main(int ac, char **av) {
            char word[64], prev[64];
            int  nPrev = 0;
            while (fgets(&word[0], sizeof word, stdin)) {
                int n = strlen(word), nPfx;
                int m = n < nPrev ? n : nPrev;
                for (nPfx = 0; nPfx < m; nPfx++)
                    if (prev[nPfx] != word[nPfx]) break;
                nPS.nPfx = nPfx;
                nPS.nSfx = n - nPfx - 1;
                fwrite(&nPS, 1, 1, stdout);
                fwrite(&word[nPfx], 1, n - nPfx - 1, stdout);
                nPrev = n - 1;
                memcpy(prev, word, n - 1);
            }
            return 0;
        }
    

    words.c:

        #include <stdio.h> /* Emit all words in stdin, 1 to a line */
        #include <ctype.h> /*~tr -cs A-Za-z \\n|tr A-Z a-z|grep -vE .{16,} */
        
        int main(int ac, char **av) {
            char wd[15];
            int  n, c;
        #define PUT fwrite_unlocked(wd,1,n,stdout); putchar_unlocked('\n')
            while ((c = getchar_unlocked()) >= 0) { /* accum word chs */
                if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')) {
                    if (n < sizeof wd)          /* make smarter & .. */
                        wd[n++] = tolower(c);   /*..contraction aware. */
                    else
                        fprintf(stderr, "too long: %15.15s...\n", wd);
                } else if (n > 0) { /* non-word c & have data */
                    PUT; n = 0;     /* put & reset */
                }
            }
            if (n > 0) { PUT; }     /* put any final word */
            return 0;
        }
    

    notIn.c:

        #include <stdio.h>  /* Emit stdin words not in /tmp/dict */
        #include <string.h> /* Both sources must be delta-encoded */
        struct { unsigned nPfx:4, nSfx:4; } __attribute__((packed)) nPS;
        
        int main(int ac, char **av) {
            FILE *fI = stdin, *fD = fopen("/tmp/dict", "r");
            char  wI[16], wD[16];       /* Word buffers; then lens */
            int   nI=0, nD=0, oI, oD;   /* Flag saying last read was Ok */
        #define GET(T) /* update [wno][ID] with its next word */ \
            if (o##T = fread(&nPS, 1, sizeof nPS, f##T)==sizeof nPS) { \
                n##T = nPS.nPfx + nPS.nSfx; \
                o##T = o##T && fread(&w##T[nPS.nPfx], 1, nPS.nSfx, f##T);}
        #define PUT { fwrite(wI, 1, nI, stdout); putchar('\n'); GET(I) }
            GET(I); GET(D); /* set theory "not in": ordered merge impl */
            while (oI && oD) {
                int c = memcmp(wI, wD, nI < nD ? nI : nD);
                if (c == 0) c = nI - nD;
                if (c < 0) PUT                   /* I<D: emit&advance I */
                else if (c == 0){GET(I);GET(D);} /* I==D: advance both */
                else GET(D);                     /* I>D: advance D */
            }
            while (oI) PUT                       /* flush tail out */
        } /* (echo a;cat $SOWPODS) | delta>/tmp/dict # DE UN-abridged dict
             words < Input | sort -u | delta | notIn # Extract,DE,Check */
    

    To back up my 4.2 seconds at 200 kB/sec, you can find any SOWPODS dictionary and it compresses to about 837 KiB with that delta.c. 837/200 = 4.185.

    If the rejoinder is "that SOWPODS stuff had/has licensing trouble" then no problemo - just use whatever in house dictionary they used and stemming / auto-suffix junk and use that to synthesize an exact dictionary. Then you can correct it as you go and et voila. In fact, if you wanted to be even faster than this 15..20X faster and then make accuracy-perf trade-offs, then you could probably shrink IO by generating the delta-encoded stream directly from the suffix rules.

    I'd recommend staying exact, though. In this case, it seems a bad algo idea led them to be both inaccurate and slower and the writing was on the wall that hardware was getting faster. And honestly, my SOWPODS dictionary that seems 15X faster may well have better coverage than what they did which means an at-the-time apples to apples might have been 20..25x faster.

    As a kind of data-compression / next optimizations side-note, the 267751 SOWPODS I compressed to 857065 bytes this way can only be Zstd -19'd down to about 588040 bytes. It's all mono-case and with simply a array-of-5-bits encoding, you could honestly get that 857065 down to (857065-267751)5+26775110 bits = 703010 bytes, less than 1.2X bigger than Zstd -19, but only 1.21X better than the more naive encoding above. So, you know, simple delta encoding works like gangbusters on a sorted spelling dictionary, has a very simple set membership algo (as above), and seems like it was at least an order of magnitude faster than what they actually did instead. I'd be a bit surprised if no one pointed this out at the time.

    • Actually, @probably_wrong was actually right above. It took a while, but I found Bentley 1986 Programming Pearls Column 13 describes Kernighan & Plaugher 1981 as describing the original v6 spell in more detail as indeed the smart merge technique:

          concat ..|translit A-Z a-z|translit ^a-z @n|sort|unique|common -2 wds
      

      For the curious, in modern parlance that Unix v6 spell is:

          tr A-Z a-z | tr -c a-z \\n | sort -u | comm -2 -3 - someDict
      

      So, A) mea culpa & this corrects the record and B) my measurements above suggest that @Paul_Houle's delta encoding idea alone would have given ~4X IO scan reduction on unabridged with no word stemming accuracy trade-offs. { Delta coding is, of course, combinable with word stemming in the same way as McIlroy's ideas since "notIn" is just a set operator. You just would have needed to sort -u after stemming.. maybe a stem.c to replace the two trs. } So, I still don't think you needed large Random Access Memory.

      1 reply →