← Back to context

Comment by andrethegiant

7 months ago

I'm working on pure.md[1], which lets your scripts, APIs, apps, agents, etc reliably access web content in markdown format. Simply prefix any URL with `pure.md/` and you get the unblocked markdown content of that webpage. It avoids bot detection and renders JavaScript-heavy websites, and can convert HTML, PDFs, images, and more into pure markdown.

pure.md acts as a global caching layer between LLMs and web content. I like to think of it like a CDN for LLMs, similar to how Cloudinary is a CDN for images.

[1] https://pure.md

It seems to miss URLs?

At: https://willadams.gitbook.io/design-into-3d/2d-drawing the links for:

https://mathcs.clarku.edu/~djoyce/java/elements/elements.htm...

https://mathcs.clarku.edu/~djoyce/java/elements/bookI/bookI....

https://mathcs.clarku.edu/~djoyce/java/elements/bookI/defI1....

are rendered as:

_Elements_ _:_ _Book I_ _:_ _Definition 1_

Maybe detect when a page is on gitbook or some other site where there is .md source on github or some other site and grab the original instead?

  • By default, href values of <a> tags are removed, because they add significant token length without adding more context. Coming soon, you can specify a request header to set whether or not you want links removed from the response. Those underscores you mentioned are from the italics.

Thanks for sharing. I was planning on building something like this in April after hitting too many issues with Jina and Tavily but it looks like you've already done the hard work!

how do you exactly fallback to common crawl? isn't the cost to even hold and query common crawl insane?

[flagged]

  • i have no skin in the game and honestly i am wondering how this idea contributes to enshittifying the web more?

    this idea just seems like it provides the same content as visiting the site in a different view, like reader mode?

    • The service seems designed to bypass anti-scraping measures. If site owners don't want their content scraped by AI this is subverting their will.

      It also obfuscates responsibility between the AI vendor and the scraping service. One can imagine unethical AI providers using a series of ephemeral "gateways" to access content while avoiding any legal or reputational harm.

    • I think the parent is referring to the goal of making the web more "LLM friendly".