Comment by nerdponx
4 months ago
How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
4 months ago
How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
sentence embedding models like all-MiniLM-L6-v2 [1], bge-m3 [2]
[1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
[2] https://huggingface.co/BAAI/bge-m3
In my recent project I used openai's embedding model for that because of its convenient api and low cost.
Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.
Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.
sentence embedding models are great for this type of thing.