← Back to context

Comment by EForEndeavour

4 months ago

From my limited experience trying exactly this, it gets you 80% of the way there, then devolves into an infuriating and time-wasting exercise in endless iteration and prompting to sweep clustering parameters and labeling details to nail the remaining 20% needed for acceptance by downstream "customers" (i.e., nontechnical business people).

If your end goal is to show an audience of nontechnical stakeholders an overview of your dataset in a static medium (like a slide), I would suggest you do the cluster labeling yourself, with the help of interactive tooling to make the semantic cluster structure explorable. One option is to throw the dataset into Apple's recently published and open-sourced Embedding Atlas (https://github.com/apple/embedding-atlas), take a screenshot of the cluster viz, poke around in the semantic space, and manually annotate the top 5-10 most interesting clusters right in Google Slides or PowerPoint. If you need more control over the embedding and projection steps (and you have a bit more time), write your own embedding and projection, then use something like Plotly to build a quick interactive viz just for yourself; drop a screenshot into a slide and annotate it. Feels super dumb, but is guaranteed to produce human-friendly output you can actually present confidently as part of your data story and get on with your life.