Hacker News new | past | comments | ask | show | jobs | submit login
Can we RAG the whole web? (philippeoger.com)
21 points by jeanloolz 11 days ago | hide | past | favorite | 21 comments





This is exactly what https://www.perplexity.ai/ is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.

The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.

For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.


Is connecting a search engine to an LLM not technically a RAG for the whole web?

"We" in the article refers to the collective of individual users, not search engine companies.

The spirit of the article is about how this can be achieved in a decentralized way without search engines, and instead with just your LLM and the embedding databases that it proposes that each website would publish.

A problem with this is that you still need to keep local copies of these databases that you get from crawling the web, and train your LLM to use it.


I don't think so. AFAIK when you do that, it searches normally based on the terms you give, and then does RAG on the results. Semantically this seems like "doing RAG on the web", but RAG is a specific operation that is only being applied to a subset of the web (indexed results) in that case.

Cool idea. This is a decentralized RAG approach and useful for individual site, e.g. those from Wordpress. How do you find the site that you want to "RAG" on, though? Individual domains can be vast, e.g. Google itself.

Well, there's nothing new under the sun. The whatever cooperation model you may have come up with, it has been invented again, and again, and again.

Before you invent a new protocol, look at Semantic Web (RDF et al), and Google Microformats, and...


I think we need a search engine that has an API. Doesn't Kagi has an API?

Not the whole web; LinkedIn and a few others block us and we fully respect robots.txt, but we have ~8 billion pages.

edit: from article, "Doing this for a few urls is easy but doing it for billions of urls starts to get tricky and expensive (although not completely out of reach)" - indeed so, but we have now done embeddings for about half of those ~8 billion pages and are using them for mojeek.com.

We have an API with many features including uniquely authority and ranking scorings. Embeddings could be added.

https://www.mojeek.com/services/search/web-search-api/ used by Kagi, Meta and others. Self-disclosure; Mojeek team member.


Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.

Brave Search also has one https://brave.com/search/api/

Better pricing than Bing, as well. And their summary feature is pretty good.

Perplexity has models that include RAG access available via API - their "online" models. https://docs.perplexity.ai/docs/model-cards

Kagis API is very limited relative to what features they offer through their web interface. You can't search using the full results they serve. They only offer their index as an API, which I think is relatively small.

FIYDRI^: The core idea discussed in this post is less about RAG and more about sharing web content in packages that are easier for crawlers to access - including an experiment that uses downloadable SQLite databases for that.

^ For If You Didn't Read It


this is exa's mission: https://exa.ai

I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.

Here's their blog article for it: https://help.kagi.com/kagi/ai/quick-answer.html You have to fire up your bullshit detector when looking at the results, but I find it saves a good 3/4 clicks on average.


"RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses."

Aren't the LLM's already trained on the whole web? no need to RAG, in theory.

Training doesn't work like that. Just because a model has been exposed to text in its training data doesn't mean the model will "remember" the details of that text.

Llama 3 was trained on 15 trillion tokens, but I can download a version of that model that's just 4GB in size.

No matter how "big" your model is there is still scope for techniques like RAG if you want it to be able to return answers grounded in actual text, as opposed to often-correct hallucinations spun up from the giant matrices of numbers in the model weights.


They're only trained up to a certain point in time, so adding RAG should hypothetically allow such LLMs to access the most up-to-date information.

GPT-2 was launched in 2019, followed by GPT-3 in 2020, and GPT-4 in 2023. RAG is necessary to bridge informational gaps in between long LLM release cycles.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: