For checking? Just a lookup on disk (no db, just a large list with a custom index, then binary search in the retrieved block). Decoding anything was slow, and in-core was basically out of the question [1]. Caching was important, though, since just a handful of words make up 50% of the text.
I once built a spell checker plus corrector which had to run in 32kB under a DOS hotkey, interacting with some word processor. On top of that, it had to run from CD ROM, and respond within a second. I could do 4 lookups, in blocks of 8kB, which gave me the option to look up the word in normal order, in reverse order, and a phonetic transcription in both directions. Each 8kB block contained quite a few words, can't remember how many. Then counting the similarities, and returning them as a sorted list. It wasn't perfect, but worked reasonably well.
[1] Adding that for professional spell checking you'd need at least 100k lemmata plus all inflections plus information per word if you have to accept compounds/agglutination.
The limit given in the article is 360KB (on floppy). At that size, you can't use Tries, you need lossy compression. A Bloom filter can get you 1 in 359 false positives with the size of word list given https://hur.st/bloomfilter/?n=234936&p=&m=360KB&k=
The error rate goes up to 1 in 66 for 256KB (in memory only);
according to https://en.wikipedia.org/wiki/Ispell ispell (1971) already used Levenshtein Distance (although from the article it is not stated if this already existed in the original version, or if it was added in later years).
Levenshtein distance up to 1, according to that article. If you have a hierarchical structure (trie or a DAG; in some sense, a DAG is a trie, but stored more efficiently, with the disadvantage that adding or removing words is hard) with valid words, it is not hard to check what words satisfy that. If you only do the inexact search after looking for the exact word and finding it missing I think it also won’t be too slow when given ‘normal’ text to spell-check.
The first article I read about the techniques used in the spell program was the 1985 May issue of Communications of the ACM (CACM for those who know), https://dl.acm.org/toc/cacm/1985/28/5, in Jon Bentley's Programming Pearls column.
Not as much detail as the blog.codingconfessions.com article mentioned above, maybe some of the other/later techniques were added later on?
For checking? Just a lookup on disk (no db, just a large list with a custom index, then binary search in the retrieved block). Decoding anything was slow, and in-core was basically out of the question [1]. Caching was important, though, since just a handful of words make up 50% of the text.
I once built a spell checker plus corrector which had to run in 32kB under a DOS hotkey, interacting with some word processor. On top of that, it had to run from CD ROM, and respond within a second. I could do 4 lookups, in blocks of 8kB, which gave me the option to look up the word in normal order, in reverse order, and a phonetic transcription in both directions. Each 8kB block contained quite a few words, can't remember how many. Then counting the similarities, and returning them as a sorted list. It wasn't perfect, but worked reasonably well.
[1] Adding that for professional spell checking you'd need at least 100k lemmata plus all inflections plus information per word if you have to accept compounds/agglutination.
For the basic word list, possibly tries (https://en.wikipedia.org/wiki/Trie), DAGs (https://en.wikipedia.org/wiki/Directed_acyclic_graph#Data_co...), or Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter)
The article is about fitting large dictionaries into small memory footprints. Writing a 200K word spell checker on a machine with only 256K memory.
When you need to store your dictionary in under 1 byte per word, a trie won't cut it.
The limit given in the article is 360KB (on floppy). At that size, you can't use Tries, you need lossy compression. A Bloom filter can get you 1 in 359 false positives with the size of word list given https://hur.st/bloomfilter/?n=234936&p=&m=360KB&k=
The error rate goes up to 1 in 66 for 256KB (in memory only);
according to https://en.wikipedia.org/wiki/Ispell ispell (1971) already used Levenshtein Distance (although from the article it is not stated if this already existed in the original version, or if it was added in later years).
Levenshtein distance up to 1, according to that article. If you have a hierarchical structure (trie or a DAG; in some sense, a DAG is a trie, but stored more efficiently, with the disadvantage that adding or removing words is hard) with valid words, it is not hard to check what words satisfy that. If you only do the inexact search after looking for the exact word and finding it missing I think it also won’t be too slow when given ‘normal’ text to spell-check.
https://news.ycombinator.com/item?id=42752604
The first article I read about the techniques used in the spell program was the 1985 May issue of Communications of the ACM (CACM for those who know), https://dl.acm.org/toc/cacm/1985/28/5, in Jon Bentley's Programming Pearls column.
Not as much detail as the blog.codingconfessions.com article mentioned above, maybe some of the other/later techniques were added later on?
Link to the online version of the 1985 May Programming Pearls column: https://dl.acm.org/doi/10.1145/3532.315102
The PDF version of that article: https://dl.acm.org/doi/pdf/10.1145/3532.315102