Comment by jbellis

10 hours ago

I chased down what the "4x faster at AI tasks" was measuring:

> Testing conducted by Apple in January 2026 using preproduction 13-inch and 15-inch MacBook Air systems with Apple M5, 10-core CPU, 10-core GPU, 32GB of unified memory, and 4TB SSD, and production 13-inch and 15-inch MacBook Air systems with Apple M4, 10-core CPU, 10-core GPU, 32GB of unified memory, and 2TB SSD. Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization, and LM Studio 0.4.1 (Build 1). Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.

48 comments

jbellis

butILoveLife 7 hours ago

>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization

Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)

bbshfishe 32 minutes ago

Yeah no it didn’t. If you have a fully speced out M3/4 MacBook with enough memory you’re running pretty decent models locally already. But no one is using local models anyway.
knicholes 4 hours ago
I wonder if Apple has foresight into locally running LLMs becoming sufficiently useful.
- DiscourseFan 1 hour ago
  
  It won’t handle serious tasks but I have Gemma 3 installed on my M2 Mac and it is good for most of my needs—-esp data I don’t want a corporation getting its hands on.
  
  1 reply →
- b112 1 hour ago
  
  They do! "You're holding it wrong*

gslepak 6 hours ago

That is talking about battery life, not AI tasks. Footnote 53, where it says, "Up to 18 hours battery life":

https://www.apple.com/macbook-pro/

whynotmaybe 8 hours ago

Quite interesting that it's now a selling point just like fps in Crysis was a long time ago.

re-thc 7 hours ago
Next is the fps of an AI playing Crysis.
- dana321 7 hours ago
  
  Or tasks per minute of the AI doing your job for you
  
  2 replies →

fulafel 7 hours ago

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

dotancohen 5 hours ago
For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.
- raw_anon_1111 2 hours ago
  
  The way that voice assistants work even in the age of LLMs are:
  Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.
  TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model
  Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources
Art9681 33 minutes ago

It's going to be faster no matter what. My M3 MAX prints tokens faster than I can read for the new MoE models. It's the prompt processing that kills it when the context grows beyond a threshold which is easy to do in the modern agentic loops.

easygenes 3 hours ago

Topical. My hobby project this week (0) has been hyper-optimizing microgpt for M5's CPU cores (and comparing to MLX performance). Wonder if anything changes under the regime I've been chasing with these new chips.

0: https://entrpi.github.io/eemicrogpt/

gok 3 hours ago

consider using fp16 or bf16 for the matrix math (in SME you can use svmopa_za16_f16_m or svmopa_za16_bf16_m)

lastdong 8 hours ago

14-billion parameter model with 4-bit quantization seems rather small

derefr 5 hours ago

I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.
simlevesque 8 hours ago

It's not much for a frontier AI but it can be a very useful specialized LLM.
giancarlostoro 7 hours ago
On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.
- kraig911 11 minutes ago
  
  what model were you using?
- efxhoy 6 hours ago
  
  How is the output quality of the smaller models?
  
  1 reply →
butILoveLife 7 hours ago
For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.
And not even diehard Apple fanboys deny this.
I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.
- zitterbewegung 7 hours ago
  
  Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.
- mptest 2 hours ago
  
  In retrospect, was there a better place to learn about the cruelty of the world than runescape? Must've got scammed thrice before I lost the youthful light in my eye
- giwook 7 hours ago
  
  I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.
  There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.
  Reminds me of the saying: "A fool and his money are soon parted".
- nine_k 3 hours ago
  
  There used to be a polite way to call this out, the "Steve Jobs's reality distortion field".
  
  2 replies →
bilbo0s 7 hours ago

It is.
That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.

azinman2 10 hours ago

Seems very reasonable to me

tux3 10 hours ago
A bit strange to use time to first token instead of throughput.
Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)
- jbellis 10 hours ago
  
  As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.
- GeekyBear 10 hours ago
  
  In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.
  
  4 replies →
- hedgehog 4 hours ago
  
  Not strange, for the kind of applications models at that size are often used for the prefill is the main factor in responsiveness. Large prompt, small completion.
- case540 10 hours ago
  
  I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens
- fragmede 10 hours ago
  
  No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.
  
  1 reply →
nabakin 10 hours ago

I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.

Havoc 4 hours ago

Does that include loading the model again? Apple seems to be the only company doing such shenanigans in their measurements

nullbyte808 3 hours ago

Like saying my PC boots up 2x faster so it must be 2x more powerful. lol