Comment by vaskal08

2 years ago

Hey, other dev on this project. This is a good catch, and we're aware of this issue. What it's doing is actually using a photo caption as part of the article, and we're working on removing the use of that in the summarization process.

7 comments

vaskal08

kristopolous 2 years ago

Their are news APIs

Start with those and then figure out how to scrape a site as your input and spit out the existing API format and you'll come in through a clever side route, essentially having a two phase assembly line.

Also this will allow users to customize their "feed" as a free side effect of the architecture and furthermore you'll be able to isolate your scraping -> API transform on a per site basis, also as a free consequence and lastly, you can parallelize the work much cleaner and even have the public add their own "transformer" for their favorite news site

lxe 2 years ago

Parsing pdfs or web semantically is really not an easy job, as I found in my own foray into LLM sumamrization.

startupsfail 2 years ago

Maybe image search and if the image is not novel, ignore it?

cutemonster 2 years ago
Good point (it seems to me), and if it's AI generated, (try to) ignore it too I guess
- startupsfail 2 years ago
  
  Why? If it is an AI generated image, it was generated from a text prompt, by the author of the article. Author had reviewed the image. The image is novel.
  As long as this is novel content, it should be parsed, I think.
  
  1 reply →