← Back to context

Comment by botw44

8 hours ago

The whole thesis falls apart though. You can't be on your way to "power over everything" and get distilled into free Chinese models within months. Pick one.

The bottleneck is compute and data, not the model. That's why they could only gate it for a bit. The ITAR thing proves it: no nationality controls in place, so the only option was killing the whole thing. Not exactly what an all-powerful gatekeeper does.

> The whole thesis falls apart though. You can't be on your way to "power over everything" and get distilled into free Chinese models within months. Pick one.

But is that last part actually true though? Sure, there might be 600B+ models available for download and local inference if you have the hardware, but does the users who use Anthropic switch over to those even if they're available even as hosted models? Seems like some do, most don't, Anthropic and Claude remains very popular among the people who use LLMs, there is no denying that.

  • > does the users who use Anthropic switch over to those even if they're available even as hosted models?

    I'm currently spending $200 for Claude. That's around my maximum that I can afford. I could stretch that to $500 I guess. But I saw reports of people spending tens of thousands of dollars with Claude API. That's certainly outside of my budget.

    So if/when Anthropic decides to stop subsidizing subscription (if they ever do that thing, I still not sure about that), I'll certainly look at the other options. And available "open weights" LLMs hosted by someone will be my first pick. Right now Claude 4.8 feels very advanced, but things move very fast...

    • The ai labs would be very dumb to get rid of subscriptions. First, I don’t even think the subscriptions are losing money, I suspect they’re around break even, maybe small loses. More importantly, the subscriptions are how they lock in users and convince companies to pay api rates. Without user loyalty that they cultivate with subscriptions businesses will just use the cheapest model on open router or maybe local models.

      5 replies →

  • People dont pivot on a dime. If there stopped being major model improvements for a few years and equivalent free models have been out during the same period, we will see people slowly move over to competitors.

  • The hotness we are seeing is smaller 'expert' models with an 'orchestrator' model in front that evaulates the prompts and routes to the appropiate small models and then synthesizes the collected answer. Easier to split across many smaller, cheaper servers and more efficient than a huge monolithic model.

    • Do you have more info about this? I can't tell if you're being misled by the unfortunate "Mixture of Experts" terminology (which don't work the way you're describing), or alluding to something different.

      Or, maybe I'm wrong, but my understanding is: MoE is just an architecture to keep the activated weights smaller per token. The experts get routed basically token-by-token, and the "experts" themselves don't have a semantic domain so the "expert" word was maybe a poor choice.

      3 replies →

  • > Anthropic and Claude remains very popular among the people who use LLMs

    Only because someone else is paying the bills. I use Claude Opus at work because my employer pays for the tokens and encourages me to do it.

    At home, I use DeepSeek Flash. It's not as good, but it's maybe 0.7 quality for 0.001 cost.

    • Same, I had Deepseek search for, download and transfer (to my Linux emulation machine) the best Dreamcast games yesterday.

      GPT refused to do so (citing that it's illegal even though I own the games). Deepseek did a wonderful job for 7 cents.

      At work I use Opus because, why not? But I could easily switch to a less capable model if needed.

      1 reply →

    • I have a question that perhaps you or someone else here has an answer for: I enjoy using Opus via Google Antigravity (usually agy) for perhaps 90 minutes a week. For Google’s subsidized $20/month plan they seem to give out a reasonably generous amount of Claude tokens. How does this compare with Anthropic’s $20/month plan using Claude Code?

      BTW, I also use DeepSeek v4 Flash very frequently: fast and so cheap it is almost free.

      2 replies →

  • I don't think you're appropriately understanding the full gamut. The individuals who only spent $200/months will be stuck. But the pie is increasing in size, it's not stagnant. There are a lot of orgs who can afford to run a 1T model and even more that can run a 600B model. These newcomers are what's being fought over

I disagree. It is not the model alone. It needs a system which capitalizes on it. And this is very complex. Hardware, software, architecture - it takes a lot to get it right.

Try running the latest OS models on a normal Mac or PC. Claude Fable and Mythos are systems not just pure models.

And of course marketing. Don't believe the hype.

I think Claude is often times underwhelming. Security concerns are also a concern companies have a blond spot for. The really toughest pro security (Yes, pro! Totally different framing!) company I know is Google after all.

What I can companies advise to do is, really having more than just bug bounties but a professional hacker team that does nothing else but attacking them the whole day and night 24/7. This needs to be coordinated with the government otherwise you might sound an alarm and will be SWATed for doing good. And I would pay them huge sums since the risk and fallout warrant such a treatment, not the standard wage.

Hackers are the real deal, not AI. Proof: Hackers using AI.

  • > Try running the latest OS models on a normal Mac or PC.

    It can be done through the magic of SSD offload. The worst case involves seconds-per-token speeds, but that's OK if you only care about low volumes of slow unattended inference, which maximizes utilization for the hardware.

    (The real worst case, where you're streaming the whole model from the cheapest storage you could feasibly think of, involves multiple minutes per token for a single inference, or even hours per token batch if you're doing many inferences in bulk. That's a lot less helpful, so there's a space for smaller models at the edge, even for unattended workloads.)

  • > I disagree. It is not the model alone. It needs a system which capitalizes on it. And this is very complex.

    AFAICT … despite saying you “disagree”, you appear to be agreeing with the parent comment that the model is less important and compute (all that complex infra) and data (also complex infra) are more important.

  • An LLM which provides an OpenAI or Anthropic API-compatible interface + a coding harness like OpenCode or oh-my-pi is a pretty easy "ecosystem" to replicate. Exactly what makes you say Fable or Mythos are "systems, not just pure models"?

    • Fable can delegate tasks to Opus or Sonnet, so it has some agentic properties and I believe it does them in parallel.

      The parallelism is where this starts to fall apart on a local PC. Like I can run some Qwen quants, but I can’t run a decent Qwen model while also running another model smart enough to actually implement it. I’d have to do them in series, and given how long Fable seems to take even with parallelism, I’d probably be waiting days for an answer.

      1 reply →

  • > > The bottleneck is compute and data, not the model.

    > I disagree. It is not the model alone. It needs a system which capitalizes on it. And this is very complex. Hardware, software, architecture - it takes a lot to get it right.

    What do you disagree with exactly?

  • For now I suspect however that the gigantic models are not needed and you will be able to do pretty much what you need in a specific domain with 120b or lower. There is so much trash in the frontier models. I don't need all the world's slam poetry for my coding tasks for example.

    • Wrong, mostly.

      Model capability is a function of model size. Raising the bar raises model performance in every domain.

      An "idiot savant" model that's overtrained for a specific domain would beat a generalist model of the same size. But scale the generalist up enough, and it'll trounce the specialist. Removing poetry data from a model training mix doesn't give you much - it might even cost you some performance - and "idiot savant" approach of overtraining for a domain has a hard ceiling.

      So far, it seems like there's some equivalent of "g factor" in LLMs - a broad "intelligence" value that performance across many diverse domains correlates with. And, as a rule, larger models have more of it.

      8 replies →

"Distillation" from APIs is not a thing, it cannot replicate a model's deep reasoning and behavior.

  • I struggle with the practicality of the whole thing.

    The amount of tokens required to properly distill a frontier model is so large that by the time you could consume the # of tokens you would either be banned for extremely obvious abuse or a new model would be released, rendering your efforts less and less valuable over time. Intelligence is not a linear thing. Being behind just a little bit can have exponential consequences.

    • > Being behind just a little bit can have exponential consequences.

      That seems to be the argument of Dario, Sam et. al., but I'm not ready to believe it. Time will tell, but this can be a marathon and Anthropic and OpenAI is in getting ready to sprint the last lap of the first mile.

  • I'm uneducated on how distillation works at more than a basic level so forgive me if this is a stupid question.

    Isn't "distillation" of another provider's model exactly how these models got training date in the first place: Massive amounts of the written word + Prompt -> Answer. Why wouldn't distillation produce similar "reasoning" in the new model? It's just inputs and outputs.

    • What you're describing is (pre-)training. Distillation requires richer labels, the probability distribution over tokens (it would be logits rather than probabilities but that's not important). From a chat transcript you can only understand the argmax/most likely token of that distribution (and only if the API allows you to set the temperature to 0). It's not impossible for an API to give you that but they won't if they don't want you distilling their models.

      The intuition is that distillation exploits not only the "right" answer but the relationship between answers (what's the second most right answer? the third? etc).

    • Among other things, because you simply can't get those "massive amounts" of text from a SOTA model at reasonable cost. And complex reasoning cannot possibly be trained in a pure one-shot fashion, real post-training takes massive resources. The whole story doesn't add up.

> no nationality controls in place

Not for now, but how long before we have KYC regulations concerning LLMs?

  • That’s really what Dario wants. Let’s hope he doesn’t get it

    • what Dario wants is to retain any influence whatsover on how the research progresses before the inevitable nationalization of the frontier. he gets to keep the N-2 tech and maybe influence the N-1 tech, but the only influence on the frontier he has is today; whatever he imprints in the pipeline the government takes over.

      IOW I don't think he thinks in the same categories as most folks here.

      7 replies →

    • But he already got it, no? Claude Fable can only be made available to US citizens, which implies that every user who wants to use Claude Fable must provide proof of citizenship in some way, basically KYC.

      1 reply →

    • Regulatory capture is the OpenAI and Anthropic end goal, for certain.

      But I also think they exist in a sort of un-designed corporate narcissism, which is a common trait in bubble economies — I am not judging them particularly severely.

      Netscape under Clark and Andreessen and Sun under McNealy both fell into corporate narcissism: the belief that only they really mattered, that they were chosen, and that the world needed to rearrange itself to just let them shine. They arguably let themselves get played by Oracle (a corporate psychopath) and others as a result.

      OpenAI's position is profoundly corporate-narcissistic: all we need is all the money in the economy and not to have to do anything upsetting like think about turning a profit for the next four years. Like rich kids. It would be nice if you believed we were so important that we should get an enormous stipend for just being us.

      Anthropic's position is: we think we're so unique and ominous that government needs to make us both essential and terrifying. We have to exist otherwise worse people will.

      Both narcissistic positions.

      14 replies →

That thesis is not about what Anthropic will achieve, but about what power they think they ought to have.

That's a different problem that what you're arguing against.

To this point, I've never understood the supposed "alignment" between the EA/AI Safety crowd and Anthropic's mission that the author comments on. Be the stewards of the Machine God, but responsibly? I think the Manhattan project, which AI development is commonly analogized to, had a lot more intrinsic properties to gate against uncontrolled proliferation (which still happened to some extent). Also this is a company that is expected to go public this year, at which point there will be a slew of new voices pushing the company to increase its value, mission be damned.

People like Yud at least have a clear consistency in their advocacy that we shouldn't be developing this at all. Anyone who thinks they can reconcile Anthropic's work with the AI safety mission is in total fantasyland, if it's not just a public persona they've adopted strategically.

The distilled versions miss the spark of the model. Its like they land in the uncanny valley of models.

  • They get to 80% of the top models for 10x cheaper, unless you don't care about the money at all, it's hard to ignore.