Comment by ntoshev

16 years ago

I have a programming problem, and I'm pretty sure I'm aware of the basic components of the solution... it's just that I get swamped in complexity pretty quickly when I try out different combinations.

I know this is a very meta question but this is not the only case. I'll solve it eventually but I'd like to ask for common strategies.

Edit: Some of my strategies are: to absorb more information about similar stuff while I put solving it on back burner, to try to describe the problem clearly, to bounce the problem off other people.

-------------------------

Okay, here is the specific question: I have a map of categories of things (a few words) and descriptions what these categories mean (longer text). I also have another list of category names that sometimes uses different words or abbreviations, and I need to match the second list to the first one precisely.

I know all about tf-idf, vector space similarity between a document and a term, wordnet as a source of semantic relations between words, and it's ok to have a human map some of these. The problem is I don't know when a matching score is good enough and when I need to fall back to a human.

Use Levenshtein distance - http://en.wikipedia.org/wiki/Levenshtein_distance

  • Thanks, my boolean approximation for abbreviations so far is:

      def abbr(short, full):
          return re.match(''.join(c+'.*' for c in short), full)
    

    Actually what I'm doing now seems to be working, so far I can't see any pattern in the things my algo can't match by itself.

    Also thanks to ramanujan, who deleted his comment for some reason, but besides pointing out orgmode which I want to check, his proposition reminded me that I'm trying to deal with my dataset incrementally while a batch mode might work better.