Comment by direwolf20

1 month ago

I hope they cache search results to further reduce the number of calls to Google.

And Marginalia Search was not mentioned? Marginalia Search says they are licensing their index to Kagi. Perhaps it's counted under "Our own small-web index" which is highly misleading if true.

11 comments

direwolf20

z64 1 month ago

There is a practical limit that we can't cache results for too long; Search engine users are particularly sensitive to stale data, especially around current events. Without a holistic and realiable way to know when the cache ought to be invalidated, our caching is mostly focused on mitigating "abuse", e.g., someone / bunch of people spamming the same search in a short timespan; no sense in repeating all those upstream calls.

Most "cost saving engineering" is involved in finding cases/hueristics where we only need to use a subset of sources and omitting calls in the first place, without compromising quality. For example, we probably don't need to fire all of our sources to service a query like "youtube" or "facebook".

Marginalia data is physically consolidated into the same infra that we use for small web results in our SERP, but also among other small scale sources besides those two. That line is simply referring directly to https://kagi.com/smallweb (https://github.com/kagisearch/smallweb).

AlienRobot 1 month ago
To me, a lot of problems with "building a search engine" don't seem to be problems with "building a search engine," they seem to be problems with "building a Google."
Nobody said a search engine needs to have fresh data, for example. Nor has anybody said a search engine needs to index the entire web. Yet these are two things every search engine tries to do, and then they usually fail to compare with Google.
To put it in another way, the reason why TikTok succeeded against Youtube is exactly because TikTok wasn't trying to be a Youtube.
- Nextgrid 1 month ago
  
  I don't think TikTok "succeeded" compared to Youtube? TikTok succeeded in popularizing short-form video, but I'd argue that's a different product. YouTube is still king for longform video.
  While there might be arguments for building a different product (and LLM-based search like Perplexity is trying it), there appears to be enough demand for a "good Google" that Kagi is trying to address.
  
  1 reply →
- terribleperson 1 month ago
  
  I'll say that a search engine needs to have fresh data. When I search for a phrase from a reddit thread I saw earlier, I want that exact thread to be in the results.
  When I search for a brand new restaurant, I want to see a map entry for that restaurant and a link to a newspaper article, ad, or facebook post announcing the opening of that restaurant (though I probably won't click on the third).

packetlost 1 month ago

The index is not necessarily the code, but the dataset. IMO it would be better to be more open about the technical stack, but I don't think this feels dishonest to me.

xnx 1 month ago

> "Our own small-web index"

Has Kagi ever said what this is? I wouldn't be at all surprised if it is just kagi.com pages or a download of Wikipedia.

z64 1 month ago
https://github.com/kagisearch/smallweb
- jrmg 1 month ago
  
  From that:
  ——
  Criteria for posts to show on the website
  If the blog is included in small web feed list (which means it has content in English, it is informational/educational by nature and it is not trying to sell anything) we check for these two things to show it on the site:
  - Blog has recent posts (<7 days old)
  - The website can appear in an iframe
  ——
  Emphasis mine. Restricting visibility to blogs that post at least every week doesn’t feel very ‘small web’ to me.
  
  1 reply →
marginalia_nu 1 month ago

I believe it was formerly run under the name Teclis[1]. Reportedly they took it down for a while but now it's apparently back up. Has quite an extensive writeup on how it operates on the page.
[1] https://teclis.com/