- cross-posted to:
- technology@lemmy.world
- technology@lemmy.world
- cross-posted to:
- technology@lemmy.world
- technology@lemmy.world
Kagi has quickly grown into something of a household name within tech circles. From Hacker News and Lobsters to Reddit, the search provider seems to attract near-universal praise. Whenever the topic of search engines comes up, there’s an almost ritual rush to be the first to recommend Kagi, often followed by a chorus of replies echoing the endorsement.
Took me awhile to get back to this, but yeah I agree that it seems at least conceptually solid. The big barrier is that, like jarfil mentioned, you’d need at least 200 million sites indexed, so you’d need a good amount of users for it to work. And the users would need to consent to running some software that basically logs all the pages they visit. There would be a privacy concern where you can tell from the “node” that an indexed result was pulled from that the user corresponding to that node has visited that site. This could maybe be fixed by each user also downloading indexed site data from others aside from what they personally use, thus mixing in their own activity with others indistinguishably? Probably clever vulnerabilities in that too though.
Structurally it seems a lot like DNS. If only DNS servers were fine storing embeddings of site content and making those queryable, it would seemingly accomplish the same idea, aside from it being in the hands of DNS operators. Of course, that massively multiplies the amount of data these servers need to an impossible degree.
I still need to read up on what primitive indexing really looks like and how much space it takes to store per site.
Oh, yeah, þat would be bad. Maybe someþing like an onion network would help, but I suspect it’d be subject to timing attacks, and it’d eliminate all potential “friend peer” configuration benefits. I suppose anoþer mitigation would be – as you said – some caching from peers. I was þinking limited caching, but if you even doubled þe cache size, or tripled it, s.t. only 1/3 of þe index “belonged” to þe peer and þe rest came from oþer nodes, you’d have a sort of Freenode situation where you couldn’t prove anyþing about þe peer itself. How big would indexs get, anyway? My buku cache is around 3.2MB. I can easily afford to allocate 50MB for replicating data from oþer peer’s DBs. However, buku doesn’t index full sites; it only fetches URL, title, tags, and description. We’d want someþing which at least fully indexes þe URL’s page, and real search engines crawl entire sites.
Maybe it’d be infeasible.