This post is a classic! Also recommended: Horace also gave a related talk (covering the high-level picture of modern ML Systems) at Jane Street in Dec 2024 https://www.youtube.com/watch?v=139UPjoq7Kw
One thing people seems not to acknowledge, and this post made it super clear is that NVIDIA kept their lead extremely well in a few years of very high growth. The TFLOPs, the bandwidth, the interconnect mentioned in this post continues to grow at exponential rate with no sign of stopping yet. This is a 30-year-old incumbent reminding you. The willingness to compete from NVIDIA is just simply remarkable.
the real lesson: GPUs win on memory bandwidth not just FLOPs. batching ops keeps VRAM fed at 2TB/s instead of tripping to RAM at 50GB/s for every operation
Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.
"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
> Why are we comparing a programing language and a GPU.
You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.
When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.
It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.
the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation
but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup
in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)
another analogy:
I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)
I'm also interested in how much energy is needed, how much the hw costs and so on
Often there are many ways to do things, comparing is a great starting point for learning more
Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.
yes of course this is apples to oranges but that's kind of the point
it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU
the interesting thing is why that is so
CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …
I was researching if there was much benefit to using Rust or C++ over Python for AI, and turns out, the GPU doesn't care once the instructions are in because its an entirely different spec running on the GPU. The only thing you might save on is "startup" costs of getting your code into the GPU I guess? I assume that time cost is miniscule though, once its all in memory, nobody cares that you spent any time "booting it up" any more than how long Windows takes these days.
Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.
The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
I feel like there is no portable advice for performance. A torch model exported as onnx is a different model.
That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep.
And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess
That's interesting as I was considering GGUF --> ONNX conversions (via Olive), but if this creates unknown distortions in the effectiveness and stability, it might be a dead-end idea.
How does x.cos().cos() work faster than doing two cos calls separately? Like the first cos call returns a tensor either way, the only difference is that it's not assigned to a variable. But how is it even possible know that difference in python?
It’s really not a concept you can express in idiomatic Python very easily. This comes from the actual generated assembly involving copies from global GPU memory into registers (slow, bandwidth saturates quickly) and back in between the cosines. If you can avoid the intermediate roundtrip that cuts the cost approximately in half.
Yeah, that part should not be read literally; `x.cos().cos()` and `x1 = x.cos(); x2 = x1.cos()` both launch the same number of kernels (two in unfused/eager mode, one in fused/torch.compile, see this test notebook [1]). I think the author chained the two cos calls to symbolize the idea of combining them (without exposing the intermediate result), but chaining the two cos calls doesn't literally trigger operator fusion.
If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.
It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs
Deep learning is just glorified linear algebra. Master the progression: Feed-forward CNN RNN LSTM Attention. You don't even need a GPU to understand the climax; Karpathy’s llama2.c implements a full transformer inference engine in just ~300 lines of C using SIMD pragmas for CPU execution.
I wish more people pursued that approach to teaching neural networks.
First teach what the network does and why, writing it as a loopy, inference-only Python function. Explain training only in an abstract way, E.G. with the "take a random weight, twist it a little and see if the loss improves" algorithm. This lets you focus on the architecture and on why it is what it is.
Then, teach the intuitions behind derivatives and gradient descent. You don't need the entirety of calculus, there's no benefit to knowing how a sequence or limit works if you ) only want to understand neural networks. With autograd, you won't be manually doing derivatives of weird functions either, so intuitive understanding is a lot more important than doing dozens of traditional calculus exercises on paper like it's the 1800s. You could probably explain the little bit of calculus you need in an hour or two, even to somebody with a 12-year-old's understanding of math and a good bit of programming knowledge.
Only when people understand the training and inference, implemented with loops and descriptive variable names, teach the tensor, explain how a modern CPU and GPU works (because many programmers still think a modern computer is just a much faster 6502), and then teach the tricks we use to make it fast.
I just assume that people who are going to do useful things in ML have basic foundation in math and science. If you don’t know what a derivative is what are we doing talking about multi-variable optimization.
And it’s not about gate-keeping it’s really about being able to reason about these concepts. What this looks like in programming is people memorizing a million clean code rules and not being able to write binary search.
>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.
Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.
>We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.
This post is a classic! Also recommended: Horace also gave a related talk (covering the high-level picture of modern ML Systems) at Jane Street in Dec 2024 https://www.youtube.com/watch?v=139UPjoq7Kw
One thing people seems not to acknowledge, and this post made it super clear is that NVIDIA kept their lead extremely well in a few years of very high growth. The TFLOPs, the bandwidth, the interconnect mentioned in this post continues to grow at exponential rate with no sign of stopping yet. This is a 30-year-old incumbent reminding you. The willingness to compete from NVIDIA is just simply remarkable.
the real lesson: GPUs win on memory bandwidth not just FLOPs. batching ops keeps VRAM fed at 2TB/s instead of tripping to RAM at 50GB/s for every operation
> in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS
wild
Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.
"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
> Why are we comparing a programing language and a GPU.
You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.
When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.
It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.
the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation
but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup
in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)
another analogy:
I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)
I'm also interested in how much energy is needed, how much the hw costs and so on
Often there are many ways to do things, comparing is a great starting point for learning more
1 reply →
> This is a category error.
Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.
1 reply →
This statement makes zero sense
re comments:
yes of course this is apples to oranges but that's kind of the point
it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU
the interesting thing is why that is so
CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …
A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.
AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).
11 replies →
Which, lets be honest, is probably still being orchestrated by Python somewhere.
Python is 9.75 million times faster than Python.
I was researching if there was much benefit to using Rust or C++ over Python for AI, and turns out, the GPU doesn't care once the instructions are in because its an entirely different spec running on the GPU. The only thing you might save on is "startup" costs of getting your code into the GPU I guess? I assume that time cost is miniscule though, once its all in memory, nobody cares that you spent any time "booting it up" any more than how long Windows takes these days.
3 replies →
Single core vs multi core accounts for much of this
Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.
The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.
2 replies →
I'd want to see more about the failure modes. Production systems need graceful degradation more than optimal performance.
I feel like there is no portable advice for performance. A torch model exported as onnx is a different model.
That onnx model run using onnxruntime with cuda ep is a different model than the one run with TRT ep.
And even among the same runtime, depending on the target hardware and the memory available during tuning, the model behaves differently. It is a humongous mess
That's interesting as I was considering GGUF --> ONNX conversions (via Olive), but if this creates unknown distortions in the effectiveness and stability, it might be a dead-end idea.
How does x.cos().cos() work faster than doing two cos calls separately? Like the first cos call returns a tensor either way, the only difference is that it's not assigned to a variable. But how is it even possible know that difference in python?
The author forgot to add "fused" here, like they did in other parts of the same section.
Non-fused:
Fused, no intermediate variable:
The temporary "t" doesn't leave the GPU. Sweeping the array twice makes you twice as dependent on memory bandwidth.
It’s really not a concept you can express in idiomatic Python very easily. This comes from the actual generated assembly involving copies from global GPU memory into registers (slow, bandwidth saturates quickly) and back in between the cosines. If you can avoid the intermediate roundtrip that cuts the cost approximately in half.
Yeah, that part should not be read literally; `x.cos().cos()` and `x1 = x.cos(); x2 = x1.cos()` both launch the same number of kernels (two in unfused/eager mode, one in fused/torch.compile, see this test notebook [1]). I think the author chained the two cos calls to symbolize the idea of combining them (without exposing the intermediate result), but chaining the two cos calls doesn't literally trigger operator fusion.
[1] https://colab.research.google.com/drive/13a4Y-ko6QLMPAhBz64c...
Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch
If you aren't already aware, Karpathy has several videos that could get you there in a few hours https://www.youtube.com/@AndrejKarpathy
very thanks!
1 reply →
If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.
It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs
Needs 2022 in title
Deep learning is just glorified linear algebra. Master the progression: Feed-forward CNN RNN LSTM Attention. You don't even need a GPU to understand the climax; Karpathy’s llama2.c implements a full transformer inference engine in just ~300 lines of C using SIMD pragmas for CPU execution.
I wish more people pursued that approach to teaching neural networks.
First teach what the network does and why, writing it as a loopy, inference-only Python function. Explain training only in an abstract way, E.G. with the "take a random weight, twist it a little and see if the loss improves" algorithm. This lets you focus on the architecture and on why it is what it is.
Then, teach the intuitions behind derivatives and gradient descent. You don't need the entirety of calculus, there's no benefit to knowing how a sequence or limit works if you ) only want to understand neural networks. With autograd, you won't be manually doing derivatives of weird functions either, so intuitive understanding is a lot more important than doing dozens of traditional calculus exercises on paper like it's the 1800s. You could probably explain the little bit of calculus you need in an hour or two, even to somebody with a 12-year-old's understanding of math and a good bit of programming knowledge.
Only when people understand the training and inference, implemented with loops and descriptive variable names, teach the tensor, explain how a modern CPU and GPU works (because many programmers still think a modern computer is just a much faster 6502), and then teach the tricks we use to make it fast.
I just assume that people who are going to do useful things in ML have basic foundation in math and science. If you don’t know what a derivative is what are we doing talking about multi-variable optimization.
And it’s not about gate-keeping it’s really about being able to reason about these concepts. What this looks like in programming is people memorizing a million clean code rules and not being able to write binary search.
6 replies →
So you created a new account to blatantly plagiarize another comment from this same page? What's even going on here?
[flagged]
>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.
https://arxiv.org/abs/1912.02292
Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.
>We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.
5 replies →