Comment by nerdponx

4 months ago

How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?

3 comments

nerdponx

liqilin1567 4 months ago

sentence embedding models like all-MiniLM-L6-v2 [1], bge-m3 [2]

[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

[2] https://huggingface.co/BAAI/bge-m3

In my recent project I used openai's embedding model for that because of its convenient api and low cost.

minimaxir 4 months ago

Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.

Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.

olliepro 4 months ago

sentence embedding models are great for this type of thing.