Comment by atleastoptimal

2 years ago

Google has the biggest proprietary moat of information of any company in the world I'm sure.

No, Google is on a more level playing field than you think. It certainly can't train on enterprise data, and of course not on private user data like emails. Cross-division data sharing is tough as well, because regulators don't like it for anti-monopoly reasons. OpenAI can scrape YouTube all it wants, but DeepMind may not be able to just train against all of YouTube just like that.

We might soon get to a point where every player is using pretty much all the low-cost data there is. Everyone will use all the public internet data there is, augmented by as much private datasets as they can afford.

The improvements we can expect to see in the next few years look like a Drake equation.

LLM performance delta = data quality x data quantity x transformer architecture tweaks x compute cost x talent x time.

The ceiling for the cost parameters in this equation are determined by expected market opportunity, at the margin - how much more of the market can you capture if you have the better tech.

  • > DeepMind may not be able to just train against all of YouTube just like that

    What? Why?

    > data quality x data quantity x transformer architecture tweaks x compute cost x talent x time.

    Google arguably has the most data (it's search index), the best data (ranked and curated already, along with data sets like books), the cheapest compute (they literally run their own cloud offering and are one of the biggest purchasers of H100s), and the oldest and most mature ML team.

maybe it is too much? If you just train LLM's on the entire Internet, it will be mostly garbage.

  • I have heard claims that lots of popular LLMs, including possibly gpt-4 are trained on things like reddit. so maybe it's not quite garbage in, garbage out if you include lots of other data. Google also has untold troves of data that is not widely available on the Web. including all the books from their decades long book indexing project.

Yes, you can say that very much, again and again.

Google has the best Internet search engine bar none and personally I'd not normally use Bing if not through ChatGPT.

It has Google Book, and I believe it has been scanning books for more than a decade now. It good to know that, so when the next time Mongol-like invasion happen (as happened to old City of Baghdad) all the books contents are well backup /s

It has Google Patent, and the original idea of patenting is for knowledge dissemination in return of royalty, and that knowledge would otherwise locked behind industry closed door.

It has Google Scholar, some of the papers are behind paywall but most of the contents are already cached somewhere (e.g. Pre-Print servers, Sci-Hub, online thesis portal).

It has Google Video aka YouTube that by watching all the uploaded videos within one hour duration to YT platform, will probably last more than your lifetime (assuming lifetime watching videos doing nothing else from cradle to grave non-stop without sleeping).

Ultimately it has Google mail or Gmail and to say that Google do not access the emails on its platform it's providing for free is naive and almost all my colleagues, friends, acquaintances (people that I know personally) have Gmail.

UK ex-PM (no prize of correctly guessing who) was once said on national TV that "Google probably know about him than he knows about himself" (TM).

Google once claimed that no one has moat on LLM but from the planet that I live none has organized the world's information like Google and ironically the CEO just reminded us in the Gemini video introduction that Google corporate mission statement is to organize the world's information and AI, LLM, RAG (insert your favourite acronym soup here) are the natural extensions of what they have been doing all along.