Comment by johnwatson11218
1 day ago
I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.
I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
Yes. Please publish. Sounds very interesting
This sounds amazing, totally interested in seeing the approach and repo.
Sounds a lot like Bertopic. Great library to use.