Comment by michael-0acf4

4 months ago

I think one hot/multi hot encoding might have worked if lexicographic inconsistency is the issue.

Labeling can be done by asking the LLMs to label each tweet from a predefined set of labels, the order doesn’t matter. We can generate these labels either manually or by sampling a small subset of tweets and asking the LLM to tag them (we could be using a word dictionary too). From these labels, we can form a Label-length vector whose entries are sorted alphabetically (ordering is arbitrary but must be consistent).

To populate it, we ask the LLM to tag each tweet by restricting its JSON output to enums composed of our labels. From that, we can form hot vectors, effectively embedding and tagging each tweet at the same time.

We get tags for "free", and as a byproduct, we also get a vector embedding that can be further refined using techniques such as PCA, then you can cosim as usual. The main cost resides mainly on asking the LLM to label each tweet from the predefined set of tags, you can also micro-optimize that by batching prompts ig.