Comment by anArbitraryOne

4 days ago

Just want to say how great I am for calling this out a few months ago https://news.ycombinator.com/context?id=41470605

It's nice to hear that! And from this thread, it is not us only two—otherwise, the title wouldn't have resonated with the Hacker News community.

This blog post stemmed from my frustration that people use cosine distance without a second thought. In virtually all tutorials on vector databases, cosine distance is treated as if it were some obvious ground truth.

When questioned about cosine similarity, even seasoned data scientists will start talking about "the curse of dimensionality" or some geometric interpretations but forget that (more than often) they work with a hack.

You called it! But it is a pattern as old as the hills in the software industry. "Just add an index". "Put it in the cloud" "Do sprints". One size fits all!

That was a helpful list, in your second comment downthread. What are your top 3 metrics that perform the best on the greatest number of those features that make cosine distance perform poorly?

  • Good question. Unfortunately, I'm just a keyboard warrior asshole that bad mouths things without offering solutions