Comment by copypaper

4 months ago

I originally settled on doing this, but the problem is that you have to re-calculate everything if you ever add/remove a category. If your categories will always be static, that will work fine. But it's more than likely you'll eventually have to add another category down the line.

If your categories are dynamic, the way OP handles it will be much cheaper as the number of tweets (or customer service calls in your case) grows, as long as the cache hit rate is >0%. Each tweet will get it's own label, i.e. "joke_about_bad_technology_choices". Each of these labels gets put into a category, i.e. "tech_jokes". If you add/remove a category you would still need to re-calculate everything, however you would only need to re-calculate the labels to categories as opposed to every single tweet. Since similar tweets can share the same labels, you end up with less labels than total amount of tweets. As you reach the asymptotic ceiling, as mentioned in OPs post, your cost to re-embed labels to categories also becomes an asymptotic ceiling.

If the number of items you're categorizing is a couple thousand at most and you rarely add/remove categories, it's probably not worth the complexity. But in my case (and ops) it's worth it as the number of items grows infinitely.

0 comments

copypaper

No comments yet

Contribute on Hacker News ↗