← Back to context

Comment by cs702

3 years ago

Left unsaid in this piece is that OpenAI likely would have to increase parameters and compute by an order of magnitude (~10x) to train a new model that offers noticeable improvements over GPT-4, due to the diminishing returns seen in "transformer scaling laws."

Also, it's possible that OpenAI is still training GPT-4, perhaps with additional modalities, and will make future snapshots available as public releases.

> Left unsaid in this piece is that OpenAI likely would have to increase parameters

Maybe true, but he also said "We are not here to jerk ourselves off about parameter count"

https://techcrunch.com/2023/04/14/sam-altman-size-of-llms-wo...

Also, who says that the "transformer scaling laws" are the ultimate arbiter of LLM scaling? They overturned previous scaling laws and other scaling laws might overturn them. Furthermore, it's even possible that the transformer model won't even be used in later models. I remember Ilya making the point that just because the transformer model was the first one that looks like it can scale intelligence just by lighting up billions of dollars of GPUs, it doesn't mean it's the last one. Maybe it will even be like, the vacuum tube of AI models, and other ones are being made in secret. A hacker news rumor was that they are paying $5M-$20M per year to the top neural net experts probably to make some exotic architectures to surpass transformer.

  • > A hacker news rumor was that they are paying $5M-$20M per year to the top neural net experts probably to make some exotic architectures to surpass transformer

    This reminds me a TV interview of the author Patrick Modiano, just after he won the literature Nobel price. The presenter asked him if the money would help. The author answered essentially that the next time he would be in front of a white page, the money surely wouldn't help.

    In the case of surpassing transformers, money could help to give access to more compute power. It could also help to prevent the research from being public.

    • Modiano is a rich man, born into a rich family. Wealth doesn't help in front of a white page, but it sure helps being able to stay in front of that white page instead of having to go take up a job because you're not sure what you're eating tonight.

      As always, wealthy people and their "money doesn't make happiness" bullshit.

      2 replies →

    • If someone is already working on a problem full-time, money only helps to the extent that resources they can be buy with money are the limiting constraint. However, beyond deep work needed for a single individual, when you need to explore potential opportunities in a broad space of possibilities, money can hugely effect the search of that space because work needed for major breakthroughs remains parallelizable. You can delegate subtasks to people if you can afford those people. You can hire more of the few specialized people who know about a niche to work on your problem instead of other problems. You can exploit synergies from crosspollination of ideas from bringing together brilliant minds into the same conversations. The influx of money is very very likely to increase the pace of innovation in AI. The breadth of possible avenues for breakthroughs is largely yet-to-be-explored.

  • I'm not an expert but isn't size the distinguishing feature of an LLM? It's the first L.

    • They needed an architecture that could take advantage of the scale, first. That's what BERT did.

  • Curious if anyone can confirm $5-20M figure. Seems absurdly high but what do I know

    • Can't confirm OpenAI's position in particular, but $500k/yr/person is table stakes for a decent engineer directly connected to the company's bottom line. Double that for an actual expert, double it again if they're consulting, and put together a team of 3-10 of them. Those numbers aren't too far off.

      9 replies →

    • I wouldn't put any stock into a random twitter rumor by someone likely looking for clout. The source, some guy with likely a purchased checkmark and 12k followers (who knows how few before he claimed to have this insider knowledge), claims four(!) different "extremely reputable" sources that have independently confirmed it. How many people exactly are they making these offers to? Do they all happen to know this guy, someone with no discretion apparently, and everyone decided to tell him this information for what reason exactly?

      99% chance it's made up.

      That said, if they thought a specific individual had even a reasonable chance of coming up with an improvement on the current state-of-the-art AI architecture that they'd be able to keep entirely to themselves, $20M would be a massive bargain.

      The rumor is still almost certainly fake, but for someone very specific at this critical time in the field, I don't know if the number would be that absurd.

      2 replies →

Actually what he has said is that the biggest performance gains were from the human feedback reinforcement learning.

There are also all of the quantization and other tricks out there.

Also they have demonstrated that the model already understands images but just haven't completed the API for this.

So they use quantization to increase the speed by a factor of 3 while slightly increasing the parameter count. Maybe find a way to make the network more sparse and efficient so in the end with the quantization the model actually uses significantly less memory. and continue with the RHLF focusing on even more difficult tasks and those that incorporate visual data.

Then instead of calling it GPT-5 they just call it GPT-4.5. Twice as fast as GPT-4, IQ goes from 130 to 155. And the API now allows images to be passed in and analyzed.

  • There is an API for multimodal computer vision and visual reasoning/VQA, and it's available, just not for normies. It's exclusively for their test group and then the Be My Eyes project at https://www.bemyeyes.com/.

    • I was wondering when someone would point this out. The api is called “rainbow” and it does not only recognition / reasoning but also generation.

      It’s a very limited model for a select few.

      1 reply →

    • I assume they will release this API publicly at some point?

      It's amazing the extreme levels of advantage that groups have depending on funding and connections.

      2 replies →

I bet they’re not saying how big of a model GPT-4 is because it’s actually much smaller we would expect.

ChatGPT is IMO a heavily fine-tuned Curie sized model (same price via API + less cognitive capacity than even text davinci-003) so it would make sense that a heavily fine-tuned Davinci sized model would yield similar results to GPT-4.

  • I wouldn't bet on their pricing being indicative of their costs. If MSFT wants the ChatGPT-API to be a success and is willing to subsidize it, that's just how it is.

    • It’s not only 10x cheaper, it’s also way faster at inference and not as smart as Davinci. IMO the only logical answer is that the model is just smaller.

  • I wonder why it's slower at inference time then (for members using their web UI), or rather, if it's similar in size to gpt3, how gpt3 is optimized in a way that gpt4 isn't or can't be?

    I'd expect that by now we would enjoy similar speeds but this hasn't yet happened.

We are also starting to run out of high quality corpus to train on at such model scales. While Video offers another large set of data, we'll have to look at further RL approaches in the next few years to continue scaling datasets.

  • Is there any source for this, aside from it being oft repeated by internet speculators? Ilya has said the textual data situation is still quite good

    • If they're running into any limits in that respect, my bet would be that the limit would only on what is easily accessible to them without negotiating access, and that they can easily go another magnitude or two just with more incremental effort to strike deals. E.g. newspaper archives, national libraries and the like (I haven't looked at other languages, but GPT3's - since I don't know of any numbers for GPT4 - Norwegian corpus could easily be scaled at least two orders of magnitude with access to the Norwegian national library collection alone)

    • Depends on the quality. A ten trillion parameter model should require roughly 10 trillion tokens to train. Put another way, this would be roughly 10k Wikipedia’s or 67 Million books, Roughly 3-4 GitHub’s.

      It’s been established that LLMs are sensitive to corpus selection which is part of why we see anecdotal variance in quality across different LLM releases.

      While we could increase the corpus of text by loading social media comments, self published books, and other similar text - this may negatively impact final model quality/utility.

    • yeah i need a source on this. GPT3 corpus is, what, a few hundred TB? absoultely nowhere near the total amount of tokens we could collect eg from youtube/podcasts

I often see mistakes when chatgot is faced with more spatial reasoning, and I wonder if changes as simple as deep convolutional subnetworks in intermediaries layers would help the language model fit better in these situations. In short, I’m excited to see where things go, and can definitely see room for great improvement through improvements to the architecture!

How noticeable changes will be have little connection with loss reduction during training. Holding very complex thought processes may actually not diminish the loss function all that much. But they are very noticeable when we are interacting with these systems.

They could clean up the training data I bet. That would be where I'd focus next.

  • Is there any indication from OpenAI people that there are low hanging fruits to be picked in this direction?

    • It’s the indication data of current research: train more and better, current models are oversized and undertrained. A good foundation model can exhibit massive quality differences with just a tiny bit of quality fine tuning (e.g. Alpaca vs Koala)

      Personal opinion, not OAI/GH/MSFT’s

> Also, it's possible that OpenAI is still training GPT-4, perhaps with additional modalities, and will make future snapshots available as public releases.

Read OpenAI API docs on GPT model versions carefully, and look at them again from time to time.

https://platform.openai.com/docs/models

In my machine learning experience, if it only takes 10x the parameters brings a significant improvement I feel lucky.

Vicuna offers considerable improvement over LLaMA and it's just 13B delta to 65B model.

I would suspect they probably conditioning data for gpt 5. Im guessing ‘training’ presupposes they have the training data primed & getting data into shape seems to be one of main cruxes