Comment by woodruffw

3 years ago

Educated guess: I think step 5 in the process didn't attempt to discover links that are missing http:// or https://.

That'd explain some of the holes mentioned in these comments. I think you just want to match any "word" containing ".[valid TLD]" and then exclude invalid URLs ("@" in first part indicating email, etc).

I've been using this[0] Python library which seemed good enough for my needs in some scraping project.

0: https://github.com/lipoja/URLExtract