A Spellchecker Used To Be A Major Feat of Software Engineering

14 years ago (prog21.dadgum.com)

62 comments

quanticle

While these specific challenges are trivial now because of increased memory and CPU, user expectations have increased too, requiring more advanced features which are not trivial to implement across different platforms. Users now expect good auto-suggestion when doing text entry on mobile platforms and laugh when it fails ( http://damnyouautocorrect.com/ ). I just launched an iPad app to improve communication for speech-disabled users and nearly everyone I talked to wanted word-completion and next-word suggestion, even when offline.

While it seems trivial for Google or a database backed server to provide real-time intelligent suggestions (e.g. suggest 5 words that follow: Harry), implementing such a feature on iPad took me over a month even though I knew exactly what I wanted to make and had all the necessary data beforehand. I had a list of 1 million words and phrases with frequency of usage (16mb of text data) and wanted to suggest 3-7 words in real-time as the user typed each letter. And implementing this on iPad required quite a bit of engineering and optimization.

jbrennan 14 years ago

That sounds like a really cool technical challenge. Care to write about how you solved it?
ams6110 14 years ago

Some but not all users like this. I can't stand it. I disable all forms of auto-suggest whenever that's allowed. Please keep it an option.

Radim 14 years ago

I know it's HN folklore to claim "Trivial! Would do over a weekend!", but please...

Doing a half-decent production spell checker is STILL a major feat. Same as "just crawling the web" (further down the discussion). Both require problem understanding and engineering you can't see and appreciate at a glance.

And no, looking up individual words in some predefined dictionary doesn't qualify as half-decent spell checking, especially for non-English languages. Spelling correction is another step.

    "Their coming too sea if its reel."

nandemo 14 years ago
That's not the point of the article. He's not talking about writing a state-of-the-art spelling-and-grammar checker.
> And no, looking up individual words in some predefined dictionary doesn't qualify as half-decent spell checking,
Well, but the author is talking about that problem! Even if you don't consider that real spell-checking, his point still stands. Let's define crappy-spell-checking as "looking up individual words in some predefined dictionary"; that problem used to be hard and now it's very easy, as in, you could write one in 15 minutes using Python.
- Radim 14 years ago
  
  Ok, fair point -- I blame the misleading title :)
  I read the point of the article to compare "spell-checking then (80s) and now", whereas others read it more along the lines of "looking up static English words then and now". Your nickname sounds japanese, but I assume you're talking about English as well, with those 15 minutes.
rcfox 14 years ago
Maybe it's not fair to cite the work of super heroes, but Peter Norvig wrote a spelling corrector in 21 lines of Python: http://norvig.com/spell-correct.html
- Radim 14 years ago
  
  It is fair, it is a first step in that direction :-)
  That corrector has no context though, so it will not correct misspelled words that happen to come out as other words (incl. the "Their coming too sea if its reel.").
  This is especially important on the web, where pretty much every conceivable word is "correct" (a name of a company or otherwise). False positives are costly, and context disambiguation critical. Let the fun begin.
  
  17 replies →
SquareWheel 14 years ago
That'd be a grammar checker, that's a whole 'nother beast. That should spell check just fine.
- plq 14 years ago
  
  spell checking for languages that sport a high rate of agglutination where you can't (feasibly) enumerate every possible form of a stem is still a major feat. languages like turkish, finnish and hungarian are prime examples of this.

yason 14 years ago

Oh, those were the times.

What makes working in a finite, very limited set of resources so rewarding is that those limitations turn mere programming into art.

You can't have art unless you constrain yourself somehow. Some people paint with dots only and some people express themselves in line art. If they allowed themselves any imaginable method that is applicable, they wouldn't be doing art. They could just take a photograph of a setting, and that photograph wouldn't say a thing to anyone.

Endless bit-twiddling and struct packing may turn trivial methods into huge optimization-ridden hacks and not get you too far vertically but given only few resources, those hacks are required to turn the theoretic approach into a real application that you can actually do useful work with. Those hacks often exhibit ingenious thinking and many examples of that approach art—and the best definitely are. And any field where ingenious thinking is required will push the whole field forward.

Similarly, for example, using a Python set as a rudimentary spell-checker is fast, easy, and convenient but it's no hack because it requires no hacking. It's like taking that photograph, or using a Ferrari to reverse from your garage out to the street and then driving it back. Which ingenious tricks are you required to accomplish that? None.

The bleeding edge has simply moved and it must lie somewhere these days, of course. Maybe computer graphics—although it has always demonstrated the capability to lead the bleeding edge so there's actually no news there. The fact is that the bleeding edge is more scattered. Early on, every computer user could feel and sense the bleeding edge because it revolved around tasks so basic that you could actually measure with your own eyes. Similarly, even a newbie programmer would face the same limitations early on and learn about how others had done it. Now you can stash gigabytes of data into your heap without realizing that you did, and wonder why the computer felt sluggish for a few seconds. Or how would you compare two state of the art photorealistic, interactive real-time 3D graphics demos? There's so much going on beyond the curtains that it's difficult to evaluate the works without having extensive domain knowledge in most the technologies used.

Findind the bleeding edge has become a field in itself.

joemoon 14 years ago
I understand your general thesis, but your statements about art just seem completely off.
> You can't have art unless you constrain yourself somehow.
> They could just take a photograph of a setting, and that photograph wouldn't say a thing to anyone.
While I realize that art is subjective, I'm very surprised that you would put these conditions around what you consider to be art. Especially since it seems to be a condemnation of a large subset of photography.
- yason 14 years ago
  
  Just curious, but would you like to give me a couple of examples of recognized good art that isn't constrained by some rule, method, technique, or approach?
  A large subset of photography isn't art. In fact, most everything people create isn't art per se—if it were, there wouldn't be good art nor bad art, just art and lots and lots of it. Spend a few hours on some photo-sharing site and see what people shoot. They're photographs, but rarely art.
  But there are grades of art. Look at this search: http://goo.gl/2mLVI — a thousand sunset pictures, while maybe pretty, aren't generally art and not because it's the same sun in each picture. Most of these pictures have nothing to say. Now, some object lit by the sunset or silhouetted against it gives a lot more potential to be art. A carefully crafted study of a sunset in the form of a photograph can be art, but it requires finding certain constraints first, finding a certain angle that makes the photograph a message, and eventually conveying through the lens something that makes the viewer stop for a moment, to give an idea, to give a feeling, to give a confusion.
  
  1 reply →
- aurelianito 14 years ago
  
  I don't think so. Photography as art has never been just pointing and shooting. You have to find the right natural lightning, or the right facial expression, or the right color composition, etc. All those restrictions make the photography an art that no anyone can achieve (at least, without extensive training and effort).
  
  1 reply →

Dn_Ab 14 years ago

Interesting. But maybe a better title would be: A Spellchecker Used To Require A Major Feat of Software Engineering.

Some may say something is lost and resources wasted - taken for granted as we now brute force our way through such problems. Surely going backwards?

but now we are, a million times a second; free to disambiguate a word's meaning, check its part of speech, figure out whether its a named entity or an address, figure out if it is a key topic in the document we are looking at and write basic sentences. I agree. That is progress.

Evgeny 14 years ago

Some may say something is lost and resources wasted - taken for granted as we now brute force our way through such problems. Surely going backwards?
I think it is progress: Yes, we can brute force today through the problems that were a feat a decade or two ago. But: Not having to solve those problems frees a lot of time. Time that can be spent on problems that are a major feat today.

DanBC 14 years ago

It's a shame that solutions like Bob Morris's no dictionary spell checker[1] are left languishing just because we all have fast computers.

Getting better at a problem in one domain can spin off benefits in others.

Spelling is part of language, and language is something that computers are really bad at. Brute force helps a bit with that (auto-correct; siri;) but better understanding would be cool.

[1] (http://www.spellingsociety.org/journals/j20/spellchecking.ph...)

Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64, March 1975.

erikb 14 years ago

It seems the author of that article doesn't know that spell checking, translating and understanding text are actually major features of pretty new software, too. I don't know how the spellcheckers are for English, but in German they really suck since years. You can't just let MS Word autocorrect your text, because the spell checker will insert more errors than it can remove. So when I want to find out, how to spell a word correctly in German, I just use the Google search bar. Why? Because they consider spell checking a hard task TODAY!

Spell checking is not about comparing a list of words to what the user wrote and tell him what didn't match. It's much more about understanding the users intention and helping him shaping that intention into an officially recognised grammar/spelling. Example: "then" is a correct word. But in the context of "Google's spell check is better then Word's" "then" is actually wrong. (Google also doesn't tell you about that mistake but the first search result actually contains a "than", which is recognised as what you actually meant)

I hope I could make it clear, why I think it still is a major feature to have the best spell checker and I think, the title really should be revised.

cbsmith 14 years ago

Yes. This whole article struck me as the spell checking equivalent to "I can build Map Reduce in 5 lines of language N" meme that went around a while ago.
dextorious 14 years ago

"""It seems the author of that article doesn't know that spell checking, translating and understanding text are actually major features of pretty new software, too. """
Actually it seems that everybody reading the article got the same WRONG impression.
The author full well knows what it takes to do a i18n full-featured spell checker.
That is BESIDE the point.
What he says is that doing a basic (lame ass) spell-checker in the 80s used to be a MAJOR undertaking, and now, doing EXACTLY THE SAME is trivial.
His point is not about spell-checking.
It's about modern OS, language, CPU, HD and memory conveniences, vs what one had to deal with in the olden days.

zacharyvoase 14 years ago

This relates to Fred Brooks's 'No Silver Bullet': the 'accidental complexity' of coping with hardware constraints has given way to the 'essential complexity' of writing an algorithm to check someone's spelling.

dangoldin 14 years ago

I know Bloom Filters came in handy for this sort of work but I'd love to see some other data structures and algorithms that were developed in the 80s to deal with limited memory.

_sh 14 years ago

You might like 'How to Fit a Large Program Into a Small Machine' (http://www.csd.uwo.ca/Infocom/Articles/small.html), where some of the founders of Infocom talk about the approaches they used to fit Zork (which required a massive 1MB of memory) onto microcomputer systems which had 32K and a floppy disk drive.
See also the Digital Antiquarian's fascinating (and reasonably technical) blog on the origins of Infocom: http://www.filfre.net/tag/infocom/
defrost 14 years ago

See the 1988 paper "The world's fastest Scrabble program" by Andrew W. Appel (Princeton) and Guy J. Jacobson (Carnegie-Mellon).
They used a data structure they dubbed a DAWG (directed acyclic word graph) that was a trie for common start patterns with words and also collapsed together common endings (every "tion" ending word ended in same portion f the graph. As I recall they used 5 bits to store ASCII uppercase characters and either 11bits or 19bits to address nodes in the graph leading to a data structure that compressed words as well as any lzw type compression with the advantage that it was directly readable with respect to word extraction.
While the paper was published in 1988, the program was written and being used much earlier in '82 or '83.
http://wiki.cs.pdx.edu/cs542-spring2011/papers/appel-scrabbl...
btn 14 years ago
Jon Bentley's Progamming Pearls books are full of such insights into designing incredibly efficient data structures and algorithms.
- daantje 14 years ago
  
  Yes, and "Programming Pearls" has a nice chapter (13.8) on how doug mcilroy fit the spell dict into less than 64k ( http://code.google.com/p/unix-spell/).
  That site also has a paper on the development of spell (http://unix-spell.googlecode.com/svn/trunk/McIlroy_spell_198...).
mmphosis 14 years ago

In the 80's I got the gist of spellchecker algorithms.
An Apple II spellchecker that I used a few times seemed to spell-check using brute force means with the dictionary filling an separate 143K floppy disk. It was a somewhat slow batch-like one document-at-time operation, but it was good enough to be useful.
As a programmer in the 80's, I saw a spellchecker that used a small hash table to tell whether a word was misspelled or not. The hash table could be made even smaller if we limited the dictionary to about 16000 of the more common words. If I remember correctly, the smallest hash table was generated as an array of 16 bit integers in the source code before compiling and it's small footprint kept in memory of the built program. Once a word was identified as a misspelling, using a very fast hash look-up, suggestions of words could be pulled from the actual dictionary on disk. Using the hash, there was the possibility of false spellings. Maybe this was a Bloom Filter.
DanBC 14 years ago

Morris, Robert & Cherry, Lorinda L, 'Computer detection of typographical errors', IEEE Trans Professional Communication, vol. PC-18, no.1, pp54-64, March 1975.
They do away with a dictionary.
est 14 years ago

vp-tree. Developed in the 90th.

RodgerTheGreat 14 years ago

Progress for computing, absolutely. Progress for software engineering? Questionable. Are the clever solutions applied in legacy spellcheckers obsolete simply because sufficient brute force is now available on everyday machines to solve everyday problems?

angrycoder 14 years ago

The problem with a lot of clever solutions is that they often obfuscate the code. This makes it harder for someone to figure out exactly what and how you are doing something. The clever solutions were necessary back then because cycles, memory, and storage were at a premium. Today they are mostly premature or unnecessary optimizations.
As an example, a few months ago a young developer I worked with implemented a set of attributes on a table as a bit field. He had just read an article about then and did it because it sounded cool. It made running reports against the data and doing imports via plain SQL impossible without writing a bunch of extra functions. His clever solution that saved about $0.0001 worth of disk ended up costing several hours of developer time because he didn't just a join table and a foreign key.
chaz 14 years ago

Yes. Engineering is constraint management. You're constantly trading off one thing for another: speed, flexibility, maintenance, cost, durability, etc. When one thing fails to be a relevant constraint, it's ok to focus on the other things that are of more immediate concern.
boredguy8 14 years ago
There are always questions on the edge of computational complexity. In 30 years: "Indexing all of the individual web pages in the world used to be a tough challenge. Network speeds were in the hundreds of megabits, and..."
- hyperbovine 14 years ago
  
  I would be surprised if that analogy holds up. I have heard the Google guys state that the growth of the web easily outpaces advances in computing and bandwidth. The English language, not so much...
  
  1 reply →
- adgar 14 years ago
  
  > In 30 years: "Indexing all of the individual web pages in the world used to be a tough challenge."
  Eh, that needs qualification to make sense. Just crawling and trivially tokenizing the web counts as "indexing all the individual web pages" but that's never been such a monstrous task.
  Having results fresh to-the-day/hour/minute is a better example of something that will probably be looked at is child's play even though that's a very new, very computationally expensive development in search.

mda 14 years ago

Designing a high quality and compact spell checker for agglutinative languages like Turkish or Hungarian is definitely a non trivial and challenging problem.

WalterBright 14 years ago

Now, if they could only fix OCR technology! I'm getting tired of "the" being scanned as "die".

simcop2387 14 years ago

Just run it through google translate, German -> English. That'll fix "die" into "the", but it'll make some other things strange I bet.

bozho 14 years ago

It's not that easy today either. Loading a list of words in a hashtable will work for..Chinese (pinyin), but most languages have declensions and conjugation. And you want to validate both "cat" and "cats", both "walk", "walking" and "walked" and the word list normally wouldn't contain those. (haven't checked the English ones, but it certainly doesn't for languages with more complex inflection).

Yes, you don't have the memory complications that are really hard, but you still need to think. Get a proper data-structure (Trie, for ex), and fill it with all forms of the words.

joejohnson 14 years ago

How long was it until a grammar checker was able to be implemented with these space constraints?

ADent 14 years ago

I had a decent spell checker on my Apple //c - 128K RAM (64K commonly usable) and had to fit on 140KB floppy. While slow it worked better than about 10 years of MS Word.