Comment by tetris11

1 day ago

UMAP or TSNE would be nice, even if PCA already shows nice separation.

Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis

Great points — thank you. PCA gave me surprisingly clean separation early on, so I stuck with it for the initial run. But you’re right — throwing UMAP or t-SNE at it would definitely give a nonlinear perspective that could catch subtler patterns (or failure cases).

And yes to the cross-cluster reference idea — I didn’t build a similarity matrix between clusters, but now that you’ve said it, it feels like an obvious next step to test how much signal is really being captured.

Might spin those up as a follow-up. Appreciate the thoughtful nudge.

Do you have examples of how this reference mapping is performed? I'm interested in this for embeddings in a different modality, but don't have as much experience on the NLP side of things

  • Nothing concrete, but you essentially perform shared nearest neighbours using anchor points to each cluster you wish to map to. These form correction vectors you can then use to project from one dataset to another

When I get nice separation with PCA, I personally tend to eschew UMAP, since the relative distance of all the points to one another is easier to interpret. I avoid t-SNE at all costs, because distance in those plots are pretty much meaningless.

(Before I get yelled out, this isn't prescriptive, it's a personal preference.)

  • PCA having nice separation is extremely uncommon unless your data is unusually clean or has obvious patterns. Even for the comically-easy MNIST dataset, the PCA representation doesn't separate nicely: https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

    • "extremely uncommon" is very much not my experience when dealing with well-trained embeddings.

      I'd add that just because you can achieve separability from a method, the resulting visualization may not be super informative. The distance between clusters that appear in t-SNE projected space often have nothing to do with their distance in latent space, for example. So while you get nice separate clusters, it comes at the cost of the projected space greatly distorting/hiding the relationship between points across clusters.