Comment by vetler

5 days ago

You will absolutely struggle to get all the info you need into 700 tokens per page.

Edit: There's also the added complexity of running a browser against 1M pages, or more.

2 comments

vetler

I agree that When pages have similar structure, for one time extraction as it is (not reasoning from context), scraping with selectors is the way to go.

This library also supports HTML as input so running a browser is not required.

vetler 1 day ago

Came back here to say I was wrong! I have been experimenting, and it is doable. I have been experimenting with setting up a scraping pipeline with LLM enrichment since I wrote the comment above, and have very positive results so far. :)