A Spellchecker Used to Be a Major Feat of Software Engineering

18 years ago (prog21.dadgum.com)

22 comments

breily

http://www.norvig.com/spell-correct.html

A simple Python based spell checker along with some discussion of probability and the pitfalls of doing things the naive way.

aston 18 years ago

Norvig's code solves an even more difficult problem, spell correcting. But yeah, good read nonetheless.

pistoriusp 18 years ago

I remember bloom filters (http://en.wikipedia.org/wiki/Bloom_filter) been good for this sort of thing.

breily 18 years ago

This article [] was linked in the Reddit comments for this post - it goes into an explanation of bloom filters and then includes the code for a spell checker implemented with them.
[]: http://ipowerinfinity.wordpress.com/2008/03/02/bloom-filters...
jrockway 18 years ago
Thanks for this link. Much more interesting than the article.
- dangoldin 18 years ago
  
  Definitely. I enjoy reading about these data structures/algorithms.
  Seems a lot of these cool things came out in the 80s due to the limits of memory/computational speed.
aston 18 years ago

The idea would be to read in the entire dictionary and mark the Bloom Filter for each word. Then whenever you get a word you want to test, see whether all of the places it would mark were previously marked. If it misses any, you know it's not a word. Unfortunately, if it doesn't miss, you're only sure it's probably a word (with the worst case probability controlled by the number of hash functions you use).

pfedor 18 years ago

Looking up a word in a dictionary is only part of what a spellchecker does. The other part is offering suggestions. This one is still not easy, especially if you want it to give reasonable suggestions for all languages. For example, the algorithm vim 7.0 uses (based on tries) was first described in a research paper but was not implemented until Bram Moolenaar did it, because the implementation was hard.

tlrobinson 18 years ago

Random side note... justin.tv's spellcheck problem is a fun one... http://www.justin.tv/problems/spellcheck

dfranke 18 years ago

I don't get what the big deal is. As long as the document you're checking fits in RAM, then you can just sort the words alphabetically and then check them in a single pass through the dictionary. If you really need random access, use a B+ tree.

tom_rath 18 years ago
Where do you plan to store the dictionary for reference?
- gojomo 18 years ago
  
  Gzip compression will achieve around 6x compression on a sorted wordlist. So even the 2008 wordlist of 240K words would reduce to about 400K. Cut the list in half or more, use a compression better suited for a sorted wordlist, and you could have a useful wordlist that would fit temporarily in RAM even in 256K machines.
  
  1 reply →
- demallien 18 years ago
  
  I think that drfanke is suggesting that you load in half the compressed dictionary ( or a third, or whatever percentage works), and then because the document has been sorted in alphabetical order, you simply start comparing words. When you get to a word that is alphabetically after the last entry in the current dictionary, you load the next segment of the dictionary. It's not a bad approach, and even has the interesting property of enabling a load of speed enhancements of the algo, because the next word in your document is going to be nearby to its spelling in the dictionary...
  Does anyone know if this algo was ever actually used?
- dfranke 18 years ago
  
  Compressed on disk/floppy. Decompress one block at a time as you scan through it.
hugh 18 years ago
Isn't alphabetically sorting a document containing (potentially) tens of thousands of words going to be pretty slow on a machine of this vintage? Isn't it a bit perverse to have a spellcheck which scales worse than O(N)?
- dfranke 18 years ago
  
  Typical clock rates at that time would have been 5MHz. So let's say you have a 64KB document containing 16K words. For a merge sort that's 2^18 comparisons, each requiring a small number of CPU instructions, so it would take on the order of a quarter of a second. That's better than MS Word running on a modern system :-). Besides, no matter what you do, the bottleneck is going to be reading the dictionary off the floppy.

falsestprophet 18 years ago

It still is I think. Nothing seems to compare to Google's spell check.

ashleyw 18 years ago

Yes, and in some situations, you can easily tap into googles spell check for you own automated needs.

tptacek 18 years ago

It was just as trivial then, in decent languages, as it is now. Conversely, if you make your data set large enough (say, the whole web as a corpus, not /usr/share/dict/words), the same problem is just as hard today.

jrockway 18 years ago

Executive summary: "programming was harder when there was less abstraction".

I was hoping for a discussion of some algorithms, but instead I just got a long winded "wow, progress is amazing" rant.

This post has inspired me to start a new blog where I state the obvious and then submit it to every social news site. My first article will be "the sky is blue" except I will explain that in 1000 words or more. With ads.

xlnt 18 years ago

now that you've given away the summary, i'm not going to read your post. and i was planning to click on lots of ads.