Comment by ntoshev
16 years ago
I have a programming problem, and I'm pretty sure I'm aware of the basic components of the solution... it's just that I get swamped in complexity pretty quickly when I try out different combinations.
I know this is a very meta question but this is not the only case. I'll solve it eventually but I'd like to ask for common strategies.
Edit: Some of my strategies are: to absorb more information about similar stuff while I put solving it on back burner, to try to describe the problem clearly, to bounce the problem off other people.
-------------------------
Okay, here is the specific question: I have a map of categories of things (a few words) and descriptions what these categories mean (longer text). I also have another list of category names that sometimes uses different words or abbreviations, and I need to match the second list to the first one precisely.
I know all about tf-idf, vector space similarity between a document and a term, wordnet as a source of semantic relations between words, and it's ok to have a human map some of these. The problem is I don't know when a matching score is good enough and when I need to fall back to a human.
Use Levenshtein distance - http://en.wikipedia.org/wiki/Levenshtein_distance
Thanks, my boolean approximation for abbreviations so far is:
Actually what I'm doing now seems to be working, so far I can't see any pattern in the things my algo can't match by itself.
Also thanks to ramanujan, who deleted his comment for some reason, but besides pointing out orgmode which I want to check, his proposition reminded me that I'm trying to deal with my dataset incrementally while a batch mode might work better.