Comment by hiAndrewQuinn
7 hours ago
>This would normally not be much cause for concern; of course a dictionary program will include code to talk to dictionary-providing web sites.
Hey, an area I finally know something about. It depends on what you're trying to do.
The slimmed down version of a Finnish dictionary I provide in `tsk` [1] weighs in at around 30 MB, for about 250,000 Finnish words. It's small enough that I embed the whole dictionary directly into the binary and reconstruct the prefix search on the fly every time the user starts the app.
However, the much larger database which contains things like lemmatization and etymology information easily balloons up to many, many gigabytes in size. My problem domain is providing Truly Instant Lookup, keystroke by keystroke, so I can't really get around this level of memoization. The work to figure all this out was sufficient that I decided to make future versions a paid product instead [2].
Most other use cases would just call out to a server, because it's silly to think most people are going to download a giant database for that use case alone. A hybrid approach could also make a lot of sense, eg cache the most common 10,000 words locally and call out for the next 1.5 million, which are statistically extremely rare.
[1]: https://github.com/hiandrewquinn/tsk
[2]: https://taskusanakirja.com/ (offline for now until I get Digicert to certify my downloads wholesome for Windows resale)
No comments yet
Contribute on Hacker News ↗