Comment by AlphaAndOmega0

6 months ago

I'd have liked more explanation of the actual solutions that programmers used at the time.

8 comments

AlphaAndOmega0

For checking? Just a lookup on disk (no db, just a large list with a custom index, then binary search in the retrieved block). Decoding anything was slow, and in-core was basically out of the question [1]. Caching was important, though, since just a handful of words make up 50% of the text.

I once built a spell checker plus corrector which had to run in 32kB under a DOS hotkey, interacting with some word processor. On top of that, it had to run from CD ROM, and respond within a second. I could do 4 lookups, in blocks of 8kB, which gave me the option to look up the word in normal order, in reverse order, and a phonetic transcription in both directions. Each 8kB block contained quite a few words, can't remember how many. Then counting the similarities, and returning them as a sorted list. It wasn't perfect, but worked reasonably well.

[1] Adding that for professional spell checking you'd need at least 100k lemmata plus all inflections plus information per word if you have to accept compounds/agglutination.

Someone 6 months ago

For the basic word list, possibly tries (https://en.wikipedia.org/wiki/Trie), DAGs (https://en.wikipedia.org/wiki/Directed_acyclic_graph#Data_co...), or Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter)

Aurornis 6 months ago

The article is about fitting large dictionaries into small memory footprints. Writing a 200K word spell checker on a machine with only 256K memory.
When you need to store your dictionary in under 1 byte per word, a trie won't cut it.
bazzargh 6 months ago

The limit given in the article is 360KB (on floppy). At that size, you can't use Tries, you need lossy compression. A Bloom filter can get you 1 in 359 false positives with the size of word list given https://hur.st/bloomfilter/?n=234936&p=&m=360KB&k=
The error rate goes up to 1 in 66 for 256KB (in memory only);
tetraodonpuffer 6 months ago
according to https://en.wikipedia.org/wiki/Ispell ispell (1971) already used Levenshtein Distance (although from the article it is not stated if this already existed in the original version, or if it was added in later years).
- Someone 6 months ago
  
  Levenshtein distance up to 1, according to that article. If you have a hierarchical structure (trie or a DAG; in some sense, a DAG is a trie, but stored more efficiently, with the disadvantage that adding or removing words is hard) with valid words, it is not hard to check what words satisfy that. If you only do the inexact search after looking for the exact word and finding it missing I think it also won’t be too slow when given ‘normal’ text to spell-check.

ksherlock 6 months ago

https://news.ycombinator.com/item?id=42752604

canucker2016 6 months ago

The first article I read about the techniques used in the spell program was the 1985 May issue of Communications of the ACM (CACM for those who know), https://dl.acm.org/toc/cacm/1985/28/5, in Jon Bentley's Programming Pearls column.
Not as much detail as the blog.codingconfessions.com article mentioned above, maybe some of the other/later techniques were added later on?
Link to the online version of the 1985 May Programming Pearls column: https://dl.acm.org/doi/10.1145/3532.315102
The PDF version of that article: https://dl.acm.org/doi/pdf/10.1145/3532.315102