← Back to context

Comment by jimmySixDOF

2 years ago

The blog post link on GitHub was a nice walk through of your method and I was interested in what you think the hit rate was for getting successful text for embeddings from TFA links. 100K is a good sized corpus but wondering how many got skipped due to paywalls or 404 links or any other problems ?

Thank you for reading it.

The hit rate is low. I've only tried to get embeddings for stories with a score greater than 100. SQL Query "SELECT count(*) FROM story WHERE score > 100;" gives me 155,228 stories and the corpus size is 108,477 stories.

108,477/ 155,228 = 0,6988236658

The main problems were 404 links and posts that weren't articles (such as tweets).