Comment by embedding-shape

3 days ago

> Framework is ready. Now we need someone to actually train the model.

If Microslop aren't gonna train the model themselves to prove their own thesis, why would others? They've had 2 years (I think?) to prove BitNet in at least some way, are you really saying they haven't tried so far?

Personally that makes it slightly worrisome to just take what they say at face value, why wouldn't they train and publish a model themselves if this actually led to worthwhile results?

Because this is Microsoft, experimenting and failing is not encouraged, taking less risky bets and getting promoted is. Also no customer asked them to have 1-bit model, hence PM didn't prioritize it.

But it doesn't mean, idea is worthless.

You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea.

  • > You could have said same about Transformers, Google released it, but didn't move forward,

    I don't think you can, Google looked at the research results, and continued researching Transformers and related technologies, because they saw the value for it particularly in translations. It's part of the original paper, what direction to take, give it a read, it's relatively approachable for being a machine learning paper :)

    Sure, it took OpenAI to make it into an "assistant" that answered questions, but it's not like Google was completely sleeping on the Transformer, they just had other research directions to go into first.

    > But it doesn't mean, idea is worthless.

    I agree, they aren't, hope that wasn't what my message read as :) But, ideas that don't actually pan out in reality are slightly less useful than ideas that do pan out once put to practice. Root commentator seems to try to say "This is a great idea, it's all ready, only missing piece is for someone to do the training and it'll pan out!" which I'm a bit skeptical about, since it's been two years since they introduced the idea.

    • What OpenAI did was train increasingly large transformer model instances. which was sensible because transformers allowed for a scaling up of training compared to earlier models. The resulting instances (GPT) showed good understanding of natural language syntax and generation of mostly sensible text (which was unprecedented at the time) so they made ChatGPT by adding new stages of supervised fine tuning and RLHF to their pretrained text-prediction models.

      2 replies →

    • Google had been working on a big LLM but they wanted to resolve all the safety concerns before releasing it. It was only when OpenAI went "YOLO! Check this out!" that Google then internally said, "Damn the safety concerns, full speed ahead!" and now we find ourselves in this breakneck race in which all safety concerns have been sidelined.

      2 replies →

    • On the one hand, not publishing any new models for an architecture in almost a year seems like forever given how things are moving right now. On the other hand I don't think that's very conclusive on whether they've given up on it or have other higher priority research directions to go into first either

  • > You could have said same about Transformers, Google released it, but didn't move forward, turns out it was a great idea

    Google released transforms as research because they invented it while improving Google Translate. They had been running it for customers for years.

    Beyond that, they had publicly-used transformer based LMs ("mums") integrated into search before GPT-3 (pre-chat mode) was even trained. They were shipping transformer models generating text for years before the ChatGPT moment. Literally available on the Google SERP page is probably the widest deployment technology can have today.

    Transformers are also used widely in ASR technologies, like Google Assistant, which of course was available to hundreds of millions of users.

    Finally, they had a private-to-employees experimental LLMs available, as well as various research initatives released (meena, LaMDA, PaLM, BERT, etc) and other experiments, they just didn't productize everything (but see earlier points). They even experimented with scaling (see "Chinchilla scaling laws").

The most benign answer would be that they don’t want to further support an emerging competitor to OpenAI, which they have significant business ties to. I think the more likely answer which you hinted at is that the utility of the model falls apart as scale increases. They see the approach as a dead end so they are throwing the scraps out to the stray dogs.

  • Not to mention Microsoft's investments in Nvidia and other GPU-adjacent/dependent companies!

    A successful ternary model would basically erase all that value overnight. In fact, the entire stock market could crash!

    Think about it: This is Microsoft we're talking about! They're a convicted monopolist that has a history of manipulating the market for IT goods and services. I wouldn't put it past them to refuse to invest in training a ternary model or going so far as to buy up ternary startups just to shut them down.

    Want to make some easy money: Start a business training a ternary model and make an offer to Microsoft. I bet they'll buy you out for at least a few million even if you don't have a product yet!

    • If that were true then they simply wouldn’t have published this research to begin with.

      Occam’s Razor suggests this simply doesn’t yield as good results as the status quo

Rest assured, all the big players (openai, google, deepseek etc) have run countless experiments with 4,3,2,1.58,1 bits, and various sparse factors and shapes. This barrel has been scraped to the bottom

  • I have doubts about this. Perhaps the closed models have, but I wouldn't be so sure for the open ones.

    GLM 5, for example, is running 16-bit weights natively. This makes their 755B model 1.5TB in size. It also makes their 40B active parameters ~80GB each.

    Compare this to Kimi K2.5. 1T model, but it's 4-bit weights (int4), which makes the model ~560 GB. Their 32B active parameters are ~16 GB.

    Sure, GLM 5 is the stronger model, but is that price worth paying with 2-3x longer generation times? What about 2-3x more memory required?

    I think this barrel's bottom really hasn't been scraped.