Comment by olooney

1 day ago

I found a neat way to do high-quality "semantic soft joins" using embedding vectors[1] and the Hungarian algorithm[2] and I'm turning it into an open source Python package:

https://github.com/olooney/jellyjoin

It hits a sweet spot by being easier to use than record linkage[3][4] while still giving really good matches, so I think there's something there that might gain traction.

[1]: https://platform.openai.com/docs/guides/embeddings

[2]: https://en.wikipedia.org/wiki/Hungarian_algorithm

[3]: https://en.wikipedia.org/wiki/Record_linkage

[4]: https://recordlinkage.readthedocs.io/en/latest/

I love this as someone who used to work on max-weight matchings and now works on LLMs :)

Very neat. As a heavy user of recordlinkage, this is definitely on my radar.

Cool project!

I see you saved a spot to show how to use it with an alternative embedding model. It would be nice to be able to use the library without an OpenAI api key. Might even make sense to vendor a basic open source model in your package so it can work out of the box without remote dependencies.

  • Yes, I'm planning out-of-the-box support for nomic[1] which can run in-process, and ollama which runs as a local server and supports many free embedding models[2].

    [1]: https://www.nomic.ai/blog/posts/nomic-embed-text-v1

    [2]: https://ollama.com/search?c=embedding

    • Project is super cool.

      If you're adding more LLM integration, a cool feature might be sending the results of allow_many="left" off to an LLM completions API that supports structured outputs. Eg imagine N_left=1e5 and N_right=1e5 but they are different datasets. You could use jellyjoin to identify the top ~5 candidates in right for each left, reducing candidate matches from 1e10 to 5e5. Then you ship the 5e5 off to an LLM for final scoring/matching.