← Back to context

Comment by vaskal08

2 years ago

Hey, other dev on this project. This is a good catch, and we're aware of this issue. What it's doing is actually using a photo caption as part of the article, and we're working on removing the use of that in the summarization process.

Their are news APIs

Start with those and then figure out how to scrape a site as your input and spit out the existing API format and you'll come in through a clever side route, essentially having a two phase assembly line.

Also this will allow users to customize their "feed" as a free side effect of the architecture and furthermore you'll be able to isolate your scraping -> API transform on a per site basis, also as a free consequence and lastly, you can parallelize the work much cleaner and even have the public add their own "transformer" for their favorite news site

Parsing pdfs or web semantically is really not an easy job, as I found in my own foray into LLM sumamrization.

Maybe image search and if the image is not novel, ignore it?

  • Good point (it seems to me), and if it's AI generated, (try to) ignore it too I guess

    • Why? If it is an AI generated image, it was generated from a text prompt, by the author of the article. Author had reviewed the image. The image is novel.

      As long as this is novel content, it should be parsed, I think.

      1 reply →