This is exactly what https://www.perplexity.ai/ is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.
The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.
For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.
"We" in the article refers to the collective of individual users, not search engine companies.
The spirit of the article is about how this can be achieved in a decentralized way without search engines, and instead with just your LLM and the embedding databases that it proposes that each website would publish.
A problem with this is that you still need to keep local copies of these databases that you get from crawling the web, and train your LLM to use it.
I don't think so. AFAIK when you do that, it searches normally based on the terms you give, and then does RAG on the results. Semantically this seems like "doing RAG on the web", but RAG is a specific operation that is only being applied to a subset of the web (indexed results) in that case.
Cool idea. This is a decentralized RAG approach and useful for individual site, e.g. those from Wordpress. How do you find the site that you want to "RAG" on, though? Individual domains can be vast, e.g. Google itself.
Not the whole web; LinkedIn and a few others block us and we fully respect robots.txt, but we have ~8 billion pages.
edit: from article, "Doing this for a few urls is easy but doing it for billions of urls starts to get tricky and expensive (although not completely out of reach)" - indeed so, but we have now done embeddings for about half of those ~8 billion pages and are using them for mojeek.com.
We have an API with many features including uniquely authority and ranking scorings. Embeddings could be added.
Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.
Kagis API is very limited relative to what features they offer through their web interface. You can't search using the full results they serve. They only offer their index as an API, which I think is relatively small.
FIYDRI^: The core idea discussed in this post is less about RAG and more about sharing web content in packages that are easier for crawlers to access - including an experiment that uses downloadable SQLite databases for that.
I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.
Here's their blog article for it: https://help.kagi.com/kagi/ai/quick-answer.html
You have to fire up your bullshit detector when looking at the results, but I find it saves a good 3/4 clicks on average.
"RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses."
Training doesn't work like that. Just because a model has been exposed to text in its training data doesn't mean the model will "remember" the details of that text.
Llama 3 was trained on 15 trillion tokens, but I can download a version of that model that's just 4GB in size.
No matter how "big" your model is there is still scope for techniques like RAG if you want it to be able to return answers grounded in actual text, as opposed to often-correct hallucinations spun up from the giant matrices of numbers in the model weights.
GPT-2 was launched in 2019, followed by GPT-3 in 2020, and GPT-4 in 2023. RAG is necessary to bridge informational gaps in between long LLM release cycles.
The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.
For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.
reply