A lot of people here are commenting that if you have to care about specific latency numbers in Python you should just use another language.
I disagree. A lot of important and large codebases were grown and maintained in Python (Instagram, Dropbox, OpenAI) and it's damn useful to know how to reason your way out of a Python performance problem when you inevitably hit one without dropping out into another language, which is going to be far more complex.
Python is a very useful tool, and knowing these numbers just makes you better at using the tool.
The author is a Python Software Foundation Fellow. They're great at using the tool.
In the common case, a performance problem in Python is not the result of hitting the limit of the language but the result of sloppy un-performant code, for example unnecessarily calling a function O(10_000) times in a hot loop.
I do performance optimization for a system written in Python. Most of these numbers are useless to me, because they’re completely irrelevant until they become a problem, then I measure them myself. If you are writing your code trying to save on method calls, you’re not getting any benefit from using the language and probably should pick something else.
Good designs do not happen in a vacuum but informed with knowledge of at least the outlines of the environment.
One can have a breakfast pursuing an idea -- let me spill some sticky milk on the dining table, who cares, I will clean up if it becomes a problem later.
Another is, it's not much of an overbearing constraint not to make a mess with spilt milk in the first place, maybe it will not be a big bother later, but it's not hurting me much now, to be not be sloppy, so let me be a little hygienic.
There's a balance between making a mess and cleaning up and not making a mess in the first place. The other extreme is to be so defensive about the possibility of creating a mess that it paralyses progress.
The sweet spot is somewhere between the extremes and having the ball-park numbers in the back of one's mind helps with that. It informs about the environment.
Python’s issue is that it is incredibly slow in use cases that surprise average developers. It is incredibly slow at very basic stuff, like calling a function or accessing a dictionary.
If Python didn’t have such an enormous number of popular C and C++ based libraries it would not be here. It was saved by Numpy etc etc.
I'm not sure how Python can be described as "saved" by numpy et al., when the numerical Python ecosystem was there near the beginning, and the language and ecosystem have co-evolved? Why didn't Perl (with PDL), R or Ruby (or even php) succeed in the same way?
i hate python but if your bottleneck is that sqlite query, optimizing a handful of addition operations is a wash. thats why you need to at least have a feel for these tables
I think these kind of numbers are everywhere and not just specific to Python.
In zig, I sometimes take a brief look to the amount of cpu cycles of various operations to avoid the amount of cache misses. While I need to aware of the alignment and the size of the data type to debloat a data structure. If their logic applies, too bad, I should quit programming since all languages have their own latency on certain operations we should aware of.
There are reasons to not use Python, but that particular reason is not the one.
For some of these, there are alternative modules you can use, so it is important to know this. But if it really matters, I would think you'd know this already?
For me, it will help with selecting what language is best for a task. I think it won't change my view that python is an excellent language to prototype in though.
I think both points are fair. Python is slow - you should avoid it if speed is critical, but sometimes you can’t easily avoid it.
I think the list itself is super long winded and not very informative. A lot of operations take about the same amount of time. Does it matter that adding two ints is very slightly slower than adding two floats? (If you even believe this is true, which I don’t.) No. A better summary would say “all of these things take about the same amount of time: simple math, function calls, etc. these things are much slower: IO.” And in that form the summary is pretty obvious.
I think the list itself is super long winded and not very informative.
I agree. I have to complement the author for the effort put in. However it misses the point of the original Latency numbers every programmer should know, which is to build an intuition for making good ballpark estimations of the latency of operations and that e.g. A is two orders of magnitude more expensive than B.
Small startups end up writing code in whatever gets things working faster, because having too large a codebase with too much load is a champagne problem.
If I told you that we were going to be running a very large payments system, with customers from startups to Amazon, you'd not write it in ruby and put the data in MongoDB, and then using its oplog as a queue... but that's what Stripe looked like. They even hired a compiler team to add type checking to the language, as that made far more sense than porting a giant monorepo to something else.
Python has types, now even gradual static typing if you want to go further. It's irrelevant whether language is interpreted scripting if it solves your problem.
It’s very natural. Python is fantastic for going from 0 to 1 because it’s easy and forgiving. So lots of projects start with it. Especially anything ML focused. And it’s much harder to change tools once a project is underway.
Someone says "let's write a prototype in Python" and someone else says "are you sure we shouldn't use a a better language that is just as productive but isn't going to lock us into abysmal performance down the line?" but everyone else says "nah we don't need to worry about performance yet, and anyway it's just a prototype - we'll write a proper version when we need to"...
10 years later "ok it's too slow; our options are a) spend $10m more on servers, b) spend $5m writing a faster Python runtime before giving up later because nobody uses it, c) spend 2 years rewriting it and probably failing, during which time we can make no new features. a) it is then."
Or keep your Python scaffolding, but push the performance-critical bits down into a C or Rust extension, like numpy, pandas, PyTorch and the rest all do.
But I agree with the spirit of what you wrote - these numbers are interesting but aren’t worth memorizing. Instead, instrument your code in production to see where it’s slow in the real world with real user data (premature optimization is the root of all evil etc), profile your code (with pyspy, it’s the best tool for this if you’re looking for cpu-hogging code), and if you find yourself worrying about how long it takes to add something to a list in Python you really shouldn’t be doing that operation in Python at all.
I agree. I've been living off Python for 20 years and have never needed to know any of these numbers, nor do I need them now, for my work, contrary to the title. I also regularly use profiling for performance optimization and opt for Cython, SWIG, JIT libraries, or other tools as needed. None of these numbers would ever factor into my decision-making.
Why? I've build some massive analytic data flows in Python with turbodbc + pandas which are basically C++ fast. It uses more memory which supports your point, but on the flip-side we're talking $5-10 extra cost a year. It could frankly be $20k a year and still be cheaper than staffing more people like me to maintain these things, rather than having a couple of us and then letting the BI people use the tools we provide for them. Similarily when we do embeded work, micro-python is just so much easier to deal with for our engineering staff.
The interoperability between C and Python makes it great, and you need to know these numbers on Python to know when to actually build something in C. With Zig getting really great interoperability, things are looking better than ever.
Not that you're wrong as such. I wouldn't use Python to run an airplane, but I really don't see why you wouldn't care about the resources just because you're working with an interpreted or GC language.
> you need to know these numbers on Python to know when to actually build something in C
People usually approach this the other way, use something like pandas or numpy from the beginning if it solves your problem. Do not write matrix multiplications or joins in python at all.
If there is no library that solves your problem, it's a great indication that you should avoid python. Unless you are willing to spend 5 man-years writing a C or C++ library with good python interop.
From the complete opposite side, I've built some tiny bits of near irrelevant code where python has been unacceptable, e.g. in shell startup / in bash's PROMPT_COMMAND, etc. It ends up having a very painfully obvious startup time, even if the code is nearing the equivalent of Hello World
time python -I -c 'print("Hello World")'
real 0m0.014s
time bash --noprofile -c 'echo "Hello World"'
real 0m0.001s
These basically seem like numbers of last resort. After you’ve profiled and ruled out all of the usual culprits (big disk reads, network latency, polynomial or exponential time algorithms, wasteful overbuilt data structures, etc) and need to optimize at the level of individual operations.
I doubt there is much to gain from knowing how much memory an empty string takes. The article or the listed numbers have a weird fixation on memory usage numbers and concrete time measurements. What is way more important to "every programmer" is time and space complexity, in order to avoid designing unnecessarily slow or memory hungry programs. Under the assumption of using Python, what is the use of knowing that your int takes 28 bytes? In the end you will have to determine, whether the program you wrote meats the performance criteria you have and if it does not, then you need a smarter algorithm or way of dealing with data. It helps very little to know that your 2d-array of 1000x1000 bools is so and so big. What helps is knowing, whether it is too much and maybe you should switch to using a large integer and a bitboard approach. Or switch language.
I disagree. Performance is a leaky abstraction that *ALWAYS* matters.
Your cognition of it is either implicit or explicit.
Even if you didn't know for example that list appends was linear and not quadratic and fairly fast.
Even if you didn't give a shit if simple programs were for some reason 10000x slower than they needed to be because it meets some baseline level of good enough / and or you aren't the one impacted by the problems inefficacy creates.
Library authors beneath you would still know and the APIs you interact with and the pythonic code you see and the code LLMS generate will be affected by that leaky abstraction.
If you think that n^2 naive list appends is a bad example its not btw, python string appends are n^2 and that has and does affect how people do things, f strings for example are lazy.
Similarly a direct consequence of dictionaries being fast in Python is that they are used literally everywhere. The old Pycon 2017 talks from Raymond talk about this.
Ultimately what the author of the blog has provided is this sort of numerical justification for the implicit tacit sort of knowledge performance understanding gives.
> Under the assumption of using Python, what is the use of knowing that your int takes 28 bytes?
Relevant if your problem demands instatiation of a large number of objects. This reminds me of a post where Eric Raymond discusses the problems he faced while trying to use Reposurgeon to migrate GCC. See http://esr.ibiblio.org/?p=8161
A meta-note on the title since it looks like it’s confusing a lot of commenters: The title is a play on Jeff Dean’s famous “Latency Numbers Every Programmer Should Know” from 2012. It isn’t meant to be interpreted literally. There’s a common theme in CS papers and writing to write titles that play upon themes from past papers. Another common example is the “_____ considered harmful” titles.
Good callout on the paper reference, but this author gives gives every indication that he’s dead serious in the first paragraph. I don’t think commenters are confused.
The title was meant to be taken literally, as in you're supposed to memorize all of these numbers. It was meant as an in-joke reference to the original writing to signal that this document was going to contain timing values for different operations.
I completely understand why it's frustrating or confusing by itself, though.
From what I've been able to glean, it was basically created in the first few years Jeff worked at Google, on indexing and serving for the original search engine. For example, the comparison of cache, RAM, and disk: determined whether data was stored in RAM (the index, used for retrieval) or disk (the documents, typically not used in retrieval, but used in scoring). Similarly, the comparison of California-Netherlands time- I believe Google's first international data cetner was in NL and they needed to make decisions about copying over the entire index in bulk versus serving backend queries in the US with frontends in the NL.
The numbers were always going out of date; for example, the arrival of flash drives changed disk latency significantly. I remember Jeff came to me one day and said he'd invented a compression algorithm for genomic data "so it can be served from flash" (he thought it would be wasteful to use precious flash space on uncompressed genomic data).
Every Python programmer should be thinking about far more important things than low level performance minutiae. Great reference but practically irrelevant except in rare cases where optimization is warranted. If your workload grows to the point where this stuff actually matters, great! Until then it’s a distraction.
Having general knowledge about the tools you're working with is not a distraction, it's an intellectual enrichment in any case, and can be a valuable asset in specific cases.
I am currently (as we type actually LOL) doing this exact thing in a hobby GIS project: Python got me a prototype and proof of concept, but now that I am scaling the data processing to worldwide, it is obviously too slow so I'm rewriting it (with LLM assistance) in C. The huge benefit of Python is that I have a known working (but slow) "reference implementation" to test against. So I know the C version works when it produces identical output. If I had a known-good Python version of past C, C++, Rust, etc. projects I worked on, it would have been most beneficial when it came time to test and verify.
Sometimes it’s as simple as finding the hotspot with a profiler and making a simple change to an algorithm or data structure, just like you would do in any language. The amount of handwringing people do about building systems with Python is silly.
I agree - however, that has mostly been a feeling for me for years. Things feel fast enough and fine.
This page is a nice reminder of the fact, with numbers. For a while, at least, I will Know, instead of just feel, like I can ignore the low level performance minutiae.
Collection Access and Iteration
How fast can you get data out of Python’s built-in collections? Here is a dramatic example of how much faster the correct data structure is. item in set or item in dict is 200x faster than item in list for just 1,000 items!
It seems to suggest an iteration for x in mylist is 200x slower than for x in myset. It’s the membership test that is much slower. Not the iteration. (Also for x in mydict is an iteration over keys not values, and so isn’t what we think of as an iteration on a dict’s ‘data’).
Also the overall title “Python Numbers Every Programmer Should Know” starts with 20 numbers that are merely interesting.
That all said, the formatting is nice and engaging.
I liked reading through it from a "is modern Python doing anything obviously wrong?" perspective, but strongly disagree anyone should "know" these numbers. There's like 5-10 primitives in there that everyone should know rough timings for; the rest should be derived with big-O algorithm and data structure knowledge.
It’s missing the time taken to instantiate a class.
I remember refactoring some code to improve readability, then observing something that was previously a few microseconds take tens of seconds.
The original code created a large list of lists. Each child list had 4 fields each field was a different thing, some were ints and one was a string.
I created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting. Modern Python developers would use a data class for this.
The new code was very slow. I’d love it if the author measured the time taken to instantiate a class.
Instantiating classes is in general not a performance issue in Python. Your issue here strongly sounds like you're abusing OO to pass a list of instances into every method and downstream call (not just the usual reference to self, the instance at hand). Don't do that, it shouldn't be necessary. It sounds like you're trying to get a poor-man's imitation of classmethods, without identifying and refactoring whatever it is that methods might need to access from other instances.
Please post your code snippet on StackOverflow ([python] tag) or CodeReview.SE so people can help you fix it.
> created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting.
I went to the doctor and I said “It hurts when I do this”
The doctor said, “don’t do that”.
Edit: so yeah a rather snarky reply. Sorry. But it’s worth asking why we want to use classes and objects everywhere. Alan Kay is well known for saying object orientated is about message passing (mostly by Erlang people).
A list of lists (where each list is four different types repeated) seems a fine data structure, which can be operated on by external functions, and serialised pretty easily. Turning it into classes and objects might not be a useful refactoring, I would certainly want to learn more before giving the go ahead.
The main reason why is to keep a handle on complexity.
When you’re in a project with a few million lines of code and 10 years of history it can get confusing.
Your data will have been handled by many different functions before it gets to you. If you do this with raw lists then the code gets very confusing. In one data structure customer name might be [4] and another structure might have it in [9]. Worse someone adds a new field in [5] then when two lists get concatenated name moves to [10] in downstream code which consumes the concatenated lists.
That's a long list of numbers that seem oddly specific. Apart from learning that f-strings are way faster than the alternatives, and certain other comparisons, I'm not sure what I would use this for day-to-day.
After skimming over all of them, it seems like most "simple" operations take on the order of 20ns. I will leave with that rule of thumb in mind.
Thanks for the that bit of info! I was surprised by the speed difference. I have always assumed that most variations of basic string formatting would compile to the same bytecode.
I usually prefer classic %-formatting for readability when the arguments are longer and f-strings when the arguments are shorter. Knowing there is a material performance difference at scale, might shift the balance in favour of f-strings for some situations.
That number isn't very useful either, it really depends on the hardware. Most virtualized server CPUs where e.g. Django will run on in the end are nowhere near the author's M4 Pro.
Last time I benchmarked a VPS it was about the performance of an Ivy Bridge generation laptop.
> Last time I benchmarked a VPS it was about the performance of an Ivy Bridge generation laptop.
I have a number of Intel N95 systems around the house for various things. I've found them to be a pretty accurate analog for small instances VPSes. The N95 are Intel E-cores which are effectively Sandy Bridge/Ivy Bridge cores.
Stuff can fly on my MacBook but than drag on a small VPS instance but validating against an N95 (I already have) is helpful. YMMV.
I think we can safely steelman the claim to "every Python programmer should know", and even from there, every "serious" Python programmer, writing Python professionally for some "important" reason, not just everyone who picks up Python for some scripting task. Obviously there's not much reason for a C# programmer to go try to memorize all these numbers.
Though IMHO it suffices just to know that "Python is 40-50x slower than C and is bad at using multiple CPUs" is not just some sort of anti-Python propaganda from haters, but a fairly reasonable engineering estimate. If you know that you don't really need that chart. If your task can tolerate that sort of performance, you're fine; if not, figure out early how you are going to solve that problem, be it through the several ways of binding faster code to Python, using PyPy, or by not using Python in the first place, whatever is appropriate for your use case.
This is really weird thing to worry about in python. But is also misleading; Python int is arbitrary precision, they can take up much more storage and arithmetic time depending in their value.
You absolutely do not need to know those absolute numbers--only the relative costs of various operations.
Additionally, regardless of the code you can profile the system to determine where the "hot spots" are and refactor or call-out to more performant (Rust, Go, C) run-times for those workflows where necessary.
I'm surprised that the `isinstance()` comparison is with `type() == type` and not `type() is type`, which I would expect to be faster, since the `==` implementation tends to have an `isinstance` call anyway.
The one I noticed the most was import openai and import numpy.
They're both about a full second on my old laptop.
I ended up writing my own simple LLM library just so I wouldn't have to import OpenAI anymore for my interactive scripts.
(It's just some wrapper functions around the equivalent of a curl request, which is honestly basically everything I used the OpenAI library for anyway.)
I have noticed how long it takes to import numpy. It made rerunning a script noticably sluggish. Not sure what openai's excuse is, but I assume numpy's slowness is loading some native dlls?
Interesting information but these are not hard numbers.
Surely the 100-char string information of 141 bytes is not correct as it would only apply to ASCII 100-char strings.
It would be more useful to know the overhead for unicode strings presumably utf-8 encoded. And again I would presume 100-Emoji string would take 441 bytes (just a hypothesis) and 100-umlaut chars string would take 241bytes.
There are lots of discussions about relatedness of these numbers for a regular software engineer.
Firstly, I want to start with the fact that the base system is a macOS/M4Pro, hence;
- Memory related access is possibly much faster than a x86 server.
- Disk access is possibly much slower than a x86 server.
*) I took x86 server as the basis as most of the applications run on x86 Linux boxes nowadays, although a good amount of fingerprint is also on other ARM CPUs.
Although it probably does not change the memory footprint much, the libraries loaded and their architecture (ie. being Rosetta or not) will change the overall footprint of the process.
As it was mentioned on one of the sibling comments -> Always inspect/trace your own workflow/performance before making assumptions. It all depends on specific use-cases for higher-level performance optimizations.
I doubt list and string concatenation operate in constant time, or else they affect another benchmark. E.g., you can concatenate two lists in the same time, regardless of their size, but at the cost of slower access to the second one (or both).
More contentiously: don't fret too much over performance in Python. It's a slow language (except for some external libraries, but that's not the point of the OP).
String concatenation is mentioned twice on that page, with the same time given. The first time it has a parenthetical "(small)", the second time doesn't have it. I expect you were looking at the second one when you typed that as I would agree that you can't just label it as a constant time, but they do seem to have meant concatenating "small" strings, where the overhead of Python's object construction would dominate the cost of the construction of the combined string.
Great catalogue. On the topic of msgspec, since pydantic is included it may be worth including a bench for de-serializing and serializing from a msgspec struct.
That appears to be the size of the list itself, not including the objects it contains: 8 bytes per entry for the object pointer, and a kilo-to-kibi conversion. All Python values are "boxed", which is probably a more important thing for a Python programmer to know than most of these numbers.
The list of floats is larger, despite also being simply an array of 1000 8-byte pointers. I assume that it's because the int array is constructed from a range(), which has a __len__(), and therefore the list is allocated to exactly the required size; but the float array is constructed from a generator expression and is presumably dynamically grown as the generator runs and has a bit of free space at the end.
That's impressive how you figured out the reason for the difference in list of floats vs list of ints container size, framed as an interview question that would have been quite difficult I think
It's important to know that these numbers will vary based on what you're measuring, your hardware architecture, and how your particular Python binary was built.
For example, my M4 Max running Python 3.14.2 from Homebrew (built, not poured) takes 19.73MB of RAM to launch the REPL (running `python3` at a prompt).
The same Python version launched on the same system with a single invocation for `time.sleep()`[1] takes 11.70MB.
My Intel Mac running Python 3.14.2 from Homebrew (poured) takes 37.22MB of RAM to launch the REPL and 9.48MB for `time.sleep`.
My number for "how much memory it's using" comes from running `ps auxw | grep python`, taking the value of the resident set size (RSS column), and dividing by 1,024.
1: python3 -c 'from time import sleep; sleep(100)'
Thanks for the feedback everyone. I appreciate your posting it @woodenchair and @aurornis for pointing out the intent of the article.
The idea of the article is NOT to suggest you should shave 0.5ns off by choosing some dramatically different algorithm or that you really need to optimize the heck out of everything.
In fact, I think a lot of what the numbers show is that over thinking the optimizations often isn't worth it (e.g. caching len(coll) into a variable rather than calling it over and over is less useful that it might seem conceptually).
Just write clean Python code. So much of it is way faster than you might have thought.
My goal was only to create a reference to what various operations cost to have a mental model.
I didn't tell anyone to optimize anything. I just posted numbers. It's not my fault some people are wired that way. Anytime I suggested some sort of recommendation it was to NOT optimize.
For example, from the post "Maybe we don’t have to optimize it out of the test condition on a while loop looping 100 times after all."
String operations in Python are fast as well. f-strings are the fastest formatting style, while even the slowest style is still measured in just nano-seconds.
Concatenation (+) 39.1 ns (25.6M ops/sec)
f-string 64.9 ns (15.4M ops/sec)
It says f-strings are fastest but the numbers show concatenation taking less time? I thought it might be a typo but the bars on the graph reflect this too?
String concatenation isn't usually considered a "formatting style", that refers to the other three rows of the table which use a template string and have specialized syntax inside it to format the values.
This is helpful. Someone should create a similar benchmark for the BEAM. This is also a good reminder to continue working on snakepit [1] and snakebridge [2]. Plenty remains before they're suitable for prime time.
As someone who most often works in a language that is literally orders of magnitude slower than this —- and has done so since CPU speeds were measured in double-digit megahertz —- I am crying at the notion that anything here is measured in nanoseconds
It is open source, you could just look. :) But here is a summary for you. It's not just one run and take the number:
Benchmark Iteration Process
Core Approach:
- Warmup Phase: 100 iterations to prepare the operation (default)
- Timing Runs: 5 repeated runs (default), each executing the operation a specified number of times
- Result: Median time per operation across the 5 runs
Iteration Counts by Operation Speed:
- Very fast ops (arithmetic): 100,000 iterations per run
- Fast ops (dict/list access): 10,000 iterations per run
- Medium ops (list membership): 1,000 iterations per run
- Slower ops (database, file I/O): 1,000-5,000 iterations per run
Quality Controls:
- Garbage collection is disabled during timing to prevent interference
- Warmup runs prevent cold-start bias
- Median of 5 runs reduces noise from outliers
- Results are captured to prevent compiler optimization elimination
Total Executions: For a typical benchmark with 1,000 iterations and 5 repeats, each operation runs 5,100 times (100 warmup + 5×1,000 timed) before reporting the median result.
That answers what N is (why not just say in the article). If you are only going to report medians, is there an appendix with further statistics such as confidence intervals or standard deviations. For serious benchmark, it would be essential to show the spread or variability, no?
I think a lot of commenters here are missing the point.
Looking at performance numbers is important regardless if it's python, assembly or HDL. If you don't understand why your code is slow you can always look at how many cycles things take and learn to understand how code works at a deeper level, as you mature as a programmer things will become obvious, but going through the learning process and having references like these will help you to get there sooner, seeing the performance numbers and asking why some things take much longer—or sometimes why they take the exact same time—is the perfect opportunity to learn.
Early in my python career I had a python script that found duplicate files across my disks, the first iteration of the script was extremely slow, optimizing the script went through several iterations as I learned how to optimize at various levels. None of them required me to use C. I just used caching, learned to enumerate all files on disk fast, and used sets instead of lists. The end result was that doing subsequent runs made my script run in 10 seconds instead of 15 minutes. Maybe implementing in C would make it run in 1 second, but if I had just assumed my script was slow because of python then I would've spent hours doing it in C only to go from 15 minutes to 14 minutes and 51 seconds.
There's an argument to be made that it would be useful to see C numbers next to the python ones, but for the same reason people don't just tell you to just use an FPGA instead of using C, it's also rude to say python is the wrong tool when often it isn't.
Initially I thought how efficient strings are... but then I understood how inefficient arithmetic is.
Interesting comparison but exact speed and IO depend on a lot of things, and unlikely one uses Mac mini in production so these numbers definitely aren't representative.
That's an "all or nothing" fallacy. Just because you use Python and are OK with some slowdown, doesn't mean you're OK with each and every slowdown when you can do better.
To use a trivial example, using a set instead of a list to check membership is a very basic replacement, and can dramatically improve your running time in Python. Just because you use Python doesn't mean anything goes regarding performance.
Great reference overall, but some of these will diverge in practice: 141 bytes for a 100 char string won’t hold for non-ASCII strings for example, and will change if/when the object header overhead changes.
int is larger than float, but list of floats is larger than list of ints
Then again, if you're worried about any of the numbers in this article maybe you shouldn't be using Python at all. I joke, but please do at least use Numba or Numpy so you aren't paying huge overheads for making an object of every little datum.
It is always a good idea to have at least a rough understanding of how much operations in your code cost, but sometimes very expensive mistakes end up in non-obvious places.
If I have only plain Python installed and a .py file that I want to test, then what's the easiest way to get a visualization of the call tree (or something similar) and the computational cost of each item?
I have some questions and requests for clarification/suspicious behavior I noticed after reviewing the results and the benchmark code, specifically:
- If slotted attribute reads and regular attribute reads are the same latency, I suspect that either the regular class may not have enough "bells on" (inheritance/metaprogramming/dunder overriding/etc) to defeat simple optimizations that cache away attribute access, thus making it equivalent in speed to slotted classes. I know that over time slotting will become less of a performance boost, but--and this is just my intuition and I may well be wrong--I don't get the impression that we're there yet.
- Similarly "read from @property" seems suspiciously fast to me. Even with descriptor-protocol awareness in the class lookup cache, the overhead of calling a method seems surprisingly similar to the overhead of accessing a field. That might be explained away by the fact that property descriptors' "get" methods are guaranteed to be the simplest and easiest to optimize of all call forms (bound method, guaranteed to never be any parameters), and so the overhead of setting up the stack/frame/args may be substantially minimized...but that would only be true if the property's method body was "return 1" or something very fast. The properties tested for these benchmarks, though, are looking up other fields on the class, so I'd expect them to be a lot slower than field access, not just a little slower (https://news.ycombinator.com/item?id=46056895) and not representative. To benchmark "time it takes for the event loop to spin once and produce a result"/the python equivalent of process.nextTick, it'd be better to use low-level loop methods like "call_soon" or defer completion to a Task and await that.
tfa mentions running benchmark on a multi-core platform, but doesn't mention if benchmark results used multithreading.. a brief look at the code suggests not
Sad that your comment is downvoted. But yes, for those who need clarification:
1) Measurements are faulty. List of 1,000 ints can be 4x smaller. Most time measurements depend on circumstances that are not mentioned, therefore can't be reproduced.
2) Brainrot AI style. Hashmap is not "200x faster than list!", that's not how complexity works.
3) orjson/ujson are faulty, which is one of the reasons they don't replace stdlib implementation. Expect crashes, broken jsons, anything from them
4) What actually will be used in number-crunching applications - numpy or similar libraries - is not even mentioned.
A lot of people here are commenting that if you have to care about specific latency numbers in Python you should just use another language.
I disagree. A lot of important and large codebases were grown and maintained in Python (Instagram, Dropbox, OpenAI) and it's damn useful to know how to reason your way out of a Python performance problem when you inevitably hit one without dropping out into another language, which is going to be far more complex.
Python is a very useful tool, and knowing these numbers just makes you better at using the tool. The author is a Python Software Foundation Fellow. They're great at using the tool.
In the common case, a performance problem in Python is not the result of hitting the limit of the language but the result of sloppy un-performant code, for example unnecessarily calling a function O(10_000) times in a hot loop.
I wrote up a more focused "Python latency numbers you should know" as a quiz here https://thundergolfer.com/computers-are-fast
I do performance optimization for a system written in Python. Most of these numbers are useless to me, because they’re completely irrelevant until they become a problem, then I measure them myself. If you are writing your code trying to save on method calls, you’re not getting any benefit from using the language and probably should pick something else.
It's always a balance.
Good designs do not happen in a vacuum but informed with knowledge of at least the outlines of the environment.
One can have a breakfast pursuing an idea -- let me spill some sticky milk on the dining table, who cares, I will clean up if it becomes a problem later.
Another is, it's not much of an overbearing constraint not to make a mess with spilt milk in the first place, maybe it will not be a big bother later, but it's not hurting me much now, to be not be sloppy, so let me be a little hygienic.
There's a balance between making a mess and cleaning up and not making a mess in the first place. The other extreme is to be so defensive about the possibility of creating a mess that it paralyses progress.
The sweet spot is somewhere between the extremes and having the ball-park numbers in the back of one's mind helps with that. It informs about the environment.
No.
Python’s issue is that it is incredibly slow in use cases that surprise average developers. It is incredibly slow at very basic stuff, like calling a function or accessing a dictionary.
If Python didn’t have such an enormous number of popular C and C++ based libraries it would not be here. It was saved by Numpy etc etc.
I'm not sure how Python can be described as "saved" by numpy et al., when the numerical Python ecosystem was there near the beginning, and the language and ecosystem have co-evolved? Why didn't Perl (with PDL), R or Ruby (or even php) succeed in the same way?
22ns for a function call and dictionary key lookup, that's actually surprisingly fast.
i hate python but if your bottleneck is that sqlite query, optimizing a handful of addition operations is a wash. thats why you need to at least have a feel for these tables
Agreed, and on top of that:
I think these kind of numbers are everywhere and not just specific to Python.
In zig, I sometimes take a brief look to the amount of cpu cycles of various operations to avoid the amount of cache misses. While I need to aware of the alignment and the size of the data type to debloat a data structure. If their logic applies, too bad, I should quit programming since all languages have their own latency on certain operations we should aware of.
There are reasons to not use Python, but that particular reason is not the one.
our build system is written in python, and i’d like it not to suck but still stay in python, so these numbers very much matter.
For some of these, there are alternative modules you can use, so it is important to know this. But if it really matters, I would think you'd know this already?
For me, it will help with selecting what language is best for a task. I think it won't change my view that python is an excellent language to prototype in though.
> ... a function O(10_000) times in a hot loop
O(10_000) is a really weird notation.
Generously we could say they probably mean ~10_000 rather than O(10_000)
I think both points are fair. Python is slow - you should avoid it if speed is critical, but sometimes you can’t easily avoid it.
I think the list itself is super long winded and not very informative. A lot of operations take about the same amount of time. Does it matter that adding two ints is very slightly slower than adding two floats? (If you even believe this is true, which I don’t.) No. A better summary would say “all of these things take about the same amount of time: simple math, function calls, etc. these things are much slower: IO.” And in that form the summary is pretty obvious.
I think the list itself is super long winded and not very informative.
I agree. I have to complement the author for the effort put in. However it misses the point of the original Latency numbers every programmer should know, which is to build an intuition for making good ballpark estimations of the latency of operations and that e.g. A is two orders of magnitude more expensive than B.
> A lot of important and large codebases were grown and maintained in Python
How does this happen? Is it just inertia that cause people to write large systems in a essentially type free, interpreted scripting language?
Small startups end up writing code in whatever gets things working faster, because having too large a codebase with too much load is a champagne problem.
If I told you that we were going to be running a very large payments system, with customers from startups to Amazon, you'd not write it in ruby and put the data in MongoDB, and then using its oplog as a queue... but that's what Stripe looked like. They even hired a compiler team to add type checking to the language, as that made far more sense than porting a giant monorepo to something else.
It's very simple. Large systems start as small systems.
1 reply →
It’s a nice and productive language. Why is that incomprehensible?
Python has types, now even gradual static typing if you want to go further. It's irrelevant whether language is interpreted scripting if it solves your problem.
It’s very natural. Python is fantastic for going from 0 to 1 because it’s easy and forgiving. So lots of projects start with it. Especially anything ML focused. And it’s much harder to change tools once a project is underway.
6 replies →
Most large things begin life as small things.
Someone says "let's write a prototype in Python" and someone else says "are you sure we shouldn't use a a better language that is just as productive but isn't going to lock us into abysmal performance down the line?" but everyone else says "nah we don't need to worry about performance yet, and anyway it's just a prototype - we'll write a proper version when we need to"...
10 years later "ok it's too slow; our options are a) spend $10m more on servers, b) spend $5m writing a faster Python runtime before giving up later because nobody uses it, c) spend 2 years rewriting it and probably failing, during which time we can make no new features. a) it is then."
6 replies →
Counterintuitively: program in python only if you can get away without knowing these numbers.
When this starts to matter, python stops being the right tool for the job.
Or keep your Python scaffolding, but push the performance-critical bits down into a C or Rust extension, like numpy, pandas, PyTorch and the rest all do.
But I agree with the spirit of what you wrote - these numbers are interesting but aren’t worth memorizing. Instead, instrument your code in production to see where it’s slow in the real world with real user data (premature optimization is the root of all evil etc), profile your code (with pyspy, it’s the best tool for this if you’re looking for cpu-hogging code), and if you find yourself worrying about how long it takes to add something to a list in Python you really shouldn’t be doing that operation in Python at all.
"if you're not measuring, you're not optimizing"
I agree. I've been living off Python for 20 years and have never needed to know any of these numbers, nor do I need them now, for my work, contrary to the title. I also regularly use profiling for performance optimization and opt for Cython, SWIG, JIT libraries, or other tools as needed. None of these numbers would ever factor into my decision-making.
.....
You don't see any value in knowing that numbers?
3 replies →
Exactly. If you're working on an application where these numbers matter, Python is far too high-level a language to actually be able to optimize them.
Why? I've build some massive analytic data flows in Python with turbodbc + pandas which are basically C++ fast. It uses more memory which supports your point, but on the flip-side we're talking $5-10 extra cost a year. It could frankly be $20k a year and still be cheaper than staffing more people like me to maintain these things, rather than having a couple of us and then letting the BI people use the tools we provide for them. Similarily when we do embeded work, micro-python is just so much easier to deal with for our engineering staff.
The interoperability between C and Python makes it great, and you need to know these numbers on Python to know when to actually build something in C. With Zig getting really great interoperability, things are looking better than ever.
Not that you're wrong as such. I wouldn't use Python to run an airplane, but I really don't see why you wouldn't care about the resources just because you're working with an interpreted or GC language.
> you need to know these numbers on Python to know when to actually build something in C
People usually approach this the other way, use something like pandas or numpy from the beginning if it solves your problem. Do not write matrix multiplications or joins in python at all.
If there is no library that solves your problem, it's a great indication that you should avoid python. Unless you are willing to spend 5 man-years writing a C or C++ library with good python interop.
3 replies →
From the complete opposite side, I've built some tiny bits of near irrelevant code where python has been unacceptable, e.g. in shell startup / in bash's PROMPT_COMMAND, etc. It ends up having a very painfully obvious startup time, even if the code is nearing the equivalent of Hello World
5 replies →
Not at all.
Some of those number are very important:
- Set membership check is 19.0 ns, list is 3.85 μs. Knowing what data structure to use for the job is paramount.
- Write 1KB file is 35.1 μs but 1MB file is only 207 μs. Knowing the implications of I/O trade off is essential.
- sum() 1,000 integers is only 1,900 ns: Knowing to leverage the stdlib makes all the difference compared to manual loop.
Etc.
A few years ago I did a Python rewrite of a big clients code base. They had a massive calculation process that took 6 servers 2 hours.
We got it down to 1 server, 10 minutes, and it was not even the goal of the mission, just the side effect of using Python correctly.
In the end, quadratic behavior is quadratic behavior.
These basically seem like numbers of last resort. After you’ve profiled and ruled out all of the usual culprits (big disk reads, network latency, polynomial or exponential time algorithms, wasteful overbuilt data structures, etc) and need to optimize at the level of individual operations.
I doubt there is much to gain from knowing how much memory an empty string takes. The article or the listed numbers have a weird fixation on memory usage numbers and concrete time measurements. What is way more important to "every programmer" is time and space complexity, in order to avoid designing unnecessarily slow or memory hungry programs. Under the assumption of using Python, what is the use of knowing that your int takes 28 bytes? In the end you will have to determine, whether the program you wrote meats the performance criteria you have and if it does not, then you need a smarter algorithm or way of dealing with data. It helps very little to know that your 2d-array of 1000x1000 bools is so and so big. What helps is knowing, whether it is too much and maybe you should switch to using a large integer and a bitboard approach. Or switch language.
I disagree. Performance is a leaky abstraction that *ALWAYS* matters.
Your cognition of it is either implicit or explicit.
Even if you didn't know for example that list appends was linear and not quadratic and fairly fast.
Even if you didn't give a shit if simple programs were for some reason 10000x slower than they needed to be because it meets some baseline level of good enough / and or you aren't the one impacted by the problems inefficacy creates.
Library authors beneath you would still know and the APIs you interact with and the pythonic code you see and the code LLMS generate will be affected by that leaky abstraction.
If you think that n^2 naive list appends is a bad example its not btw, python string appends are n^2 and that has and does affect how people do things, f strings for example are lazy.
Similarly a direct consequence of dictionaries being fast in Python is that they are used literally everywhere. The old Pycon 2017 talks from Raymond talk about this.
Ultimately what the author of the blog has provided is this sort of numerical justification for the implicit tacit sort of knowledge performance understanding gives.
> Under the assumption of using Python, what is the use of knowing that your int takes 28 bytes?
Relevant if your problem demands instatiation of a large number of objects. This reminds me of a post where Eric Raymond discusses the problems he faced while trying to use Reposurgeon to migrate GCC. See http://esr.ibiblio.org/?p=8161
A meta-note on the title since it looks like it’s confusing a lot of commenters: The title is a play on Jeff Dean’s famous “Latency Numbers Every Programmer Should Know” from 2012. It isn’t meant to be interpreted literally. There’s a common theme in CS papers and writing to write titles that play upon themes from past papers. Another common example is the “_____ considered harmful” titles.
Going to write a real banger of a paper called "latency numbers considered harmful is all you need" and watch my academic cred go through the roof.
" ... with an Application to the Entscheidungsproblem"
Good callout on the paper reference, but this author gives gives every indication that he’s dead serious in the first paragraph. I don’t think commenters are confused.
This title only works if the numbers are actually useful. Those are not, and there are far too many numbers for this to make sense.
The title was meant to be taken literally, as in you're supposed to memorize all of these numbers. It was meant as an in-joke reference to the original writing to signal that this document was going to contain timing values for different operations.
I completely understand why it's frustrating or confusing by itself, though.
That doc predates 2012 significantly.
From what I've been able to glean, it was basically created in the first few years Jeff worked at Google, on indexing and serving for the original search engine. For example, the comparison of cache, RAM, and disk: determined whether data was stored in RAM (the index, used for retrieval) or disk (the documents, typically not used in retrieval, but used in scoring). Similarly, the comparison of California-Netherlands time- I believe Google's first international data cetner was in NL and they needed to make decisions about copying over the entire index in bulk versus serving backend queries in the US with frontends in the NL.
The numbers were always going out of date; for example, the arrival of flash drives changed disk latency significantly. I remember Jeff came to me one day and said he'd invented a compression algorithm for genomic data "so it can be served from flash" (he thought it would be wasteful to use precious flash space on uncompressed genomic data).
Every Python programmer should be thinking about far more important things than low level performance minutiae. Great reference but practically irrelevant except in rare cases where optimization is warranted. If your workload grows to the point where this stuff actually matters, great! Until then it’s a distraction.
Having general knowledge about the tools you're working with is not a distraction, it's an intellectual enrichment in any case, and can be a valuable asset in specific cases.
Knowing that an empty string is 41 bytes or how many ns it takes to do arithmetic operations is not general knowledge.
12 replies →
Yeah, if you hit limits just look for a module that implements the thing in C (or write it). This is how it was always done in Python.
I am currently (as we type actually LOL) doing this exact thing in a hobby GIS project: Python got me a prototype and proof of concept, but now that I am scaling the data processing to worldwide, it is obviously too slow so I'm rewriting it (with LLM assistance) in C. The huge benefit of Python is that I have a known working (but slow) "reference implementation" to test against. So I know the C version works when it produces identical output. If I had a known-good Python version of past C, C++, Rust, etc. projects I worked on, it would have been most beneficial when it came time to test and verify.
Sometimes it’s as simple as finding the hotspot with a profiler and making a simple change to an algorithm or data structure, just like you would do in any language. The amount of handwringing people do about building systems with Python is silly.
I agree - however, that has mostly been a feeling for me for years. Things feel fast enough and fine.
This page is a nice reminder of the fact, with numbers. For a while, at least, I will Know, instead of just feel, like I can ignore the low level performance minutiae.
That's misleading. There are three types of strings in Python (1, 2 and 4 bytes per character).
https://rushter.com/blog/python-strings-and-memory/
The titles are oddly worded. For example -
It seems to suggest an iteration for x in mylist is 200x slower than for x in myset. It’s the membership test that is much slower. Not the iteration. (Also for x in mydict is an iteration over keys not values, and so isn’t what we think of as an iteration on a dict’s ‘data’).
Also the overall title “Python Numbers Every Programmer Should Know” starts with 20 numbers that are merely interesting.
That all said, the formatting is nice and engaging.
I liked reading through it from a "is modern Python doing anything obviously wrong?" perspective, but strongly disagree anyone should "know" these numbers. There's like 5-10 primitives in there that everyone should know rough timings for; the rest should be derived with big-O algorithm and data structure knowledge.
It’s missing the time taken to instantiate a class.
I remember refactoring some code to improve readability, then observing something that was previously a few microseconds take tens of seconds.
The original code created a large list of lists. Each child list had 4 fields each field was a different thing, some were ints and one was a string.
I created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting. Modern Python developers would use a data class for this.
The new code was very slow. I’d love it if the author measured the time taken to instantiate a class.
Instantiating classes is in general not a performance issue in Python. Your issue here strongly sounds like you're abusing OO to pass a list of instances into every method and downstream call (not just the usual reference to self, the instance at hand). Don't do that, it shouldn't be necessary. It sounds like you're trying to get a poor-man's imitation of classmethods, without identifying and refactoring whatever it is that methods might need to access from other instances.
Please post your code snippet on StackOverflow ([python] tag) or CodeReview.SE so people can help you fix it.
> created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting.
I went to the doctor and I said “It hurts when I do this”
The doctor said, “don’t do that”.
Edit: so yeah a rather snarky reply. Sorry. But it’s worth asking why we want to use classes and objects everywhere. Alan Kay is well known for saying object orientated is about message passing (mostly by Erlang people).
A list of lists (where each list is four different types repeated) seems a fine data structure, which can be operated on by external functions, and serialised pretty easily. Turning it into classes and objects might not be a useful refactoring, I would certainly want to learn more before giving the go ahead.
The main reason why is to keep a handle on complexity.
When you’re in a project with a few million lines of code and 10 years of history it can get confusing.
Your data will have been handled by many different functions before it gets to you. If you do this with raw lists then the code gets very confusing. In one data structure customer name might be [4] and another structure might have it in [9]. Worse someone adds a new field in [5] then when two lists get concatenated name moves to [10] in downstream code which consumes the concatenated lists.
I mean it sounds reasonable to me to wrap the data into objects.
customers[3][4]
is a lot less readable than
customers[3].balance
1 reply →
> small int (0-256) cached
It's -5 to 256, and these have very tricky behavior for programmers that confuse identity and equality.
Java does similar. Confusing for beginners who run into it for the first time for sure.
That's a long list of numbers that seem oddly specific. Apart from learning that f-strings are way faster than the alternatives, and certain other comparisons, I'm not sure what I would use this for day-to-day.
After skimming over all of them, it seems like most "simple" operations take on the order of 20ns. I will leave with that rule of thumb in mind.
If you're interested, fstrings are faster because they directly become bytecode at compile time rather than being a function call at runtime
Thanks for the that bit of info! I was surprised by the speed difference. I have always assumed that most variations of basic string formatting would compile to the same bytecode.
I usually prefer classic %-formatting for readability when the arguments are longer and f-strings when the arguments are shorter. Knowing there is a material performance difference at scale, might shift the balance in favour of f-strings for some situations.
That number isn't very useful either, it really depends on the hardware. Most virtualized server CPUs where e.g. Django will run on in the end are nowhere near the author's M4 Pro.
Last time I benchmarked a VPS it was about the performance of an Ivy Bridge generation laptop.
> Last time I benchmarked a VPS it was about the performance of an Ivy Bridge generation laptop.
I have a number of Intel N95 systems around the house for various things. I've found them to be a pretty accurate analog for small instances VPSes. The N95 are Intel E-cores which are effectively Sandy Bridge/Ivy Bridge cores.
Stuff can fly on my MacBook but than drag on a small VPS instance but validating against an N95 (I already have) is helpful. YMMV.
Python programmers don't need to know 85 different obscure performance numbers. Better to really understand ~7 general system performance numbers.
Nice numbers and it's always worth to know an order of magnitude. But these charts are far away from what "every programmer should know".
I think we can safely steelman the claim to "every Python programmer should know", and even from there, every "serious" Python programmer, writing Python professionally for some "important" reason, not just everyone who picks up Python for some scripting task. Obviously there's not much reason for a C# programmer to go try to memorize all these numbers.
Though IMHO it suffices just to know that "Python is 40-50x slower than C and is bad at using multiple CPUs" is not just some sort of anti-Python propaganda from haters, but a fairly reasonable engineering estimate. If you know that you don't really need that chart. If your task can tolerate that sort of performance, you're fine; if not, figure out early how you are going to solve that problem, be it through the several ways of binding faster code to Python, using PyPy, or by not using Python in the first place, whatever is appropriate for your use case.
This is really weird thing to worry about in python. But is also misleading; Python int is arbitrary precision, they can take up much more storage and arithmetic time depending in their value.
You absolutely do not need to know those absolute numbers--only the relative costs of various operations.
Additionally, regardless of the code you can profile the system to determine where the "hot spots" are and refactor or call-out to more performant (Rust, Go, C) run-times for those workflows where necessary.
I'm surprised that the `isinstance()` comparison is with `type() == type` and not `type() is type`, which I would expect to be faster, since the `==` implementation tends to have an `isinstance` call anyway.
Also seems like the repo is now private, so I can't open an issue, or reproduce the numbers.
The one I noticed the most was import openai and import numpy.
They're both about a full second on my old laptop.
I ended up writing my own simple LLM library just so I wouldn't have to import OpenAI anymore for my interactive scripts.
(It's just some wrapper functions around the equivalent of a curl request, which is honestly basically everything I used the OpenAI library for anyway.)
I have noticed how long it takes to import numpy. It made rerunning a script noticably sluggish. Not sure what openai's excuse is, but I assume numpy's slowness is loading some native dlls?
Interesting information but these are not hard numbers.
Surely the 100-char string information of 141 bytes is not correct as it would only apply to ASCII 100-char strings.
It would be more useful to know the overhead for unicode strings presumably utf-8 encoded. And again I would presume 100-Emoji string would take 441 bytes (just a hypothesis) and 100-umlaut chars string would take 241bytes.
There are lots of discussions about relatedness of these numbers for a regular software engineer.
Firstly, I want to start with the fact that the base system is a macOS/M4Pro, hence;
- Memory related access is possibly much faster than a x86 server. - Disk access is possibly much slower than a x86 server.
*) I took x86 server as the basis as most of the applications run on x86 Linux boxes nowadays, although a good amount of fingerprint is also on other ARM CPUs.
Although it probably does not change the memory footprint much, the libraries loaded and their architecture (ie. being Rosetta or not) will change the overall footprint of the process.
As it was mentioned on one of the sibling comments -> Always inspect/trace your own workflow/performance before making assumptions. It all depends on specific use-cases for higher-level performance optimizations.
I doubt list and string concatenation operate in constant time, or else they affect another benchmark. E.g., you can concatenate two lists in the same time, regardless of their size, but at the cost of slower access to the second one (or both).
More contentiously: don't fret too much over performance in Python. It's a slow language (except for some external libraries, but that's not the point of the OP).
String concatenation is mentioned twice on that page, with the same time given. The first time it has a parenthetical "(small)", the second time doesn't have it. I expect you were looking at the second one when you typed that as I would agree that you can't just label it as a constant time, but they do seem to have meant concatenating "small" strings, where the overhead of Python's object construction would dominate the cost of the construction of the combined string.
Great catalogue. On the topic of msgspec, since pydantic is included it may be worth including a bench for de-serializing and serializing from a msgspec struct.
What would be the explanation for an int taking 28 bytes but a list of 1000 ints taking only 7.87KB?
That appears to be the size of the list itself, not including the objects it contains: 8 bytes per entry for the object pointer, and a kilo-to-kibi conversion. All Python values are "boxed", which is probably a more important thing for a Python programmer to know than most of these numbers.
The list of floats is larger, despite also being simply an array of 1000 8-byte pointers. I assume that it's because the int array is constructed from a range(), which has a __len__(), and therefore the list is allocated to exactly the required size; but the float array is constructed from a generator expression and is presumably dynamically grown as the generator runs and has a bit of free space at the end.
That's impressive how you figured out the reason for the difference in list of floats vs list of ints container size, framed as an interview question that would have been quite difficult I think
It was. I updated the results to include the contained elements. I also updated the float list creation to match the int list creation.
It's important to know that these numbers will vary based on what you're measuring, your hardware architecture, and how your particular Python binary was built.
For example, my M4 Max running Python 3.14.2 from Homebrew (built, not poured) takes 19.73MB of RAM to launch the REPL (running `python3` at a prompt).
The same Python version launched on the same system with a single invocation for `time.sleep()`[1] takes 11.70MB.
My Intel Mac running Python 3.14.2 from Homebrew (poured) takes 37.22MB of RAM to launch the REPL and 9.48MB for `time.sleep`.
My number for "how much memory it's using" comes from running `ps auxw | grep python`, taking the value of the resident set size (RSS column), and dividing by 1,024.
1: python3 -c 'from time import sleep; sleep(100)'
Author here.
Thanks for the feedback everyone. I appreciate your posting it @woodenchair and @aurornis for pointing out the intent of the article.
The idea of the article is NOT to suggest you should shave 0.5ns off by choosing some dramatically different algorithm or that you really need to optimize the heck out of everything.
In fact, I think a lot of what the numbers show is that over thinking the optimizations often isn't worth it (e.g. caching len(coll) into a variable rather than calling it over and over is less useful that it might seem conceptually).
Just write clean Python code. So much of it is way faster than you might have thought.
My goal was only to create a reference to what various operations cost to have a mental model.
Then you should have written that. Instead you have given more fodder for the premature optimization crowd.
I didn't tell anyone to optimize anything. I just posted numbers. It's not my fault some people are wired that way. Anytime I suggested some sort of recommendation it was to NOT optimize.
For example, from the post "Maybe we don’t have to optimize it out of the test condition on a while loop looping 100 times after all."
1 reply →
I'm confused by this:
It says f-strings are fastest but the numbers show concatenation taking less time? I thought it might be a typo but the bars on the graph reflect this too?
String concatenation isn't usually considered a "formatting style", that refers to the other three rows of the table which use a template string and have specialized syntax inside it to format the values.
Perhaps it's because in all but the simplest cases, you need 2 or more concatenations to achieve the same result as one single f-string?
vs
The only case that would be faster is something like: "foo" + str(expression)
This is helpful. Someone should create a similar benchmark for the BEAM. This is also a good reminder to continue working on snakepit [1] and snakebridge [2]. Plenty remains before they're suitable for prime time.
[1] https://hex.pm/packages/snakepit [2] https://hex.pm/packages/snakebridge
As someone who most often works in a language that is literally orders of magnitude slower than this —- and has done so since CPU speeds were measured in double-digit megahertz —- I am crying at the notion that anything here is measured in nanoseconds
Hmmmm, there should absolutely be standard deviations for this type of work. Also, what is N number of runs? Does it say somewhere?
It is open source, you could just look. :) But here is a summary for you. It's not just one run and take the number:
Benchmark Iteration Process
Core Approach:
- Warmup Phase: 100 iterations to prepare the operation (default)
- Timing Runs: 5 repeated runs (default), each executing the operation a specified number of times
- Result: Median time per operation across the 5 runs
Iteration Counts by Operation Speed: - Very fast ops (arithmetic): 100,000 iterations per run
- Fast ops (dict/list access): 10,000 iterations per run
- Medium ops (list membership): 1,000 iterations per run
- Slower ops (database, file I/O): 1,000-5,000 iterations per run
Quality Controls:
- Garbage collection is disabled during timing to prevent interference
- Warmup runs prevent cold-start bias
- Median of 5 runs reduces noise from outliers
- Results are captured to prevent compiler optimization elimination
Total Executions: For a typical benchmark with 1,000 iterations and 5 repeats, each operation runs 5,100 times (100 warmup + 5×1,000 timed) before reporting the median result.
That answers what N is (why not just say in the article). If you are only going to report medians, is there an appendix with further statistics such as confidence intervals or standard deviations. For serious benchmark, it would be essential to show the spread or variability, no?
Surprised that list comprehensions are only 26% faster than for loops. It used to feel like 4-5x
The point of the original list was that the numbers were simple enough to memorize: https://gist.github.com/jboner/2841832
Nobody is going to remember any of the numbers on this new list.
That's a fair point @esafak. I updated the article with something akin to the doubling chart of numbers in the original article from 2012.
I think a lot of commenters here are missing the point.
Looking at performance numbers is important regardless if it's python, assembly or HDL. If you don't understand why your code is slow you can always look at how many cycles things take and learn to understand how code works at a deeper level, as you mature as a programmer things will become obvious, but going through the learning process and having references like these will help you to get there sooner, seeing the performance numbers and asking why some things take much longer—or sometimes why they take the exact same time—is the perfect opportunity to learn.
Early in my python career I had a python script that found duplicate files across my disks, the first iteration of the script was extremely slow, optimizing the script went through several iterations as I learned how to optimize at various levels. None of them required me to use C. I just used caching, learned to enumerate all files on disk fast, and used sets instead of lists. The end result was that doing subsequent runs made my script run in 10 seconds instead of 15 minutes. Maybe implementing in C would make it run in 1 second, but if I had just assumed my script was slow because of python then I would've spent hours doing it in C only to go from 15 minutes to 14 minutes and 51 seconds.
There's an argument to be made that it would be useful to see C numbers next to the python ones, but for the same reason people don't just tell you to just use an FPGA instead of using C, it's also rude to say python is the wrong tool when often it isn't.
Initially I thought how efficient strings are... but then I understood how inefficient arithmetic is. Interesting comparison but exact speed and IO depend on a lot of things, and unlikely one uses Mac mini in production so these numbers definitely aren't representative.
Why? If those micro benchmarks mattered in your domain, you wouldn't be using python.
That's an "all or nothing" fallacy. Just because you use Python and are OK with some slowdown, doesn't mean you're OK with each and every slowdown when you can do better.
To use a trivial example, using a set instead of a list to check membership is a very basic replacement, and can dramatically improve your running time in Python. Just because you use Python doesn't mean anything goes regarding performance.
That's an example of an algorithmic improvement (log n vs n), not a micro benchmark, Mr. Fallacy.
1 reply →
...and other hilarious jokes you can tell yourself!
Great reference overall, but some of these will diverge in practice: 141 bytes for a 100 char string won’t hold for non-ASCII strings for example, and will change if/when the object header overhead changes.
One of the reasons I'm really excited about JAX is that I hope it will allow me to write fast Python code without worrying about these details.
> Attribute read (obj.x) 14 ns
note that protobuf attributes are 20-50x worse than this
I'm confused why they repeatedly call a slots class larger than a regular dict class, but don't count the size of the dict
> Numbers are surprisingly large in Python
Makes me wonder if the cpython devs have ever considered v8-like NaN-boxing or pointer stuffing.
I wonder why an empty set takes so much more memory than an empty dict
int is larger than float, but list of floats is larger than list of ints
Then again, if you're worried about any of the numbers in this article maybe you shouldn't be using Python at all. I joke, but please do at least use Numba or Numpy so you aren't paying huge overheads for making an object of every little datum.
+1 but I didn't see pack / unpack...
Exactly wrong.
LLMs can improve Python code performance. I used it myself on a few projects.
It is always a good idea to have at least a rough understanding of how much operations in your code cost, but sometimes very expensive mistakes end up in non-obvious places.
If I have only plain Python installed and a .py file that I want to test, then what's the easiest way to get a visualization of the call tree (or something similar) and the computational cost of each item?
My god, the memory bloat is out of this world compared to platforms like the JVM or .NET, let alone C++ or Rust!
I have some questions and requests for clarification/suspicious behavior I noticed after reviewing the results and the benchmark code, specifically:
- If slotted attribute reads and regular attribute reads are the same latency, I suspect that either the regular class may not have enough "bells on" (inheritance/metaprogramming/dunder overriding/etc) to defeat simple optimizations that cache away attribute access, thus making it equivalent in speed to slotted classes. I know that over time slotting will become less of a performance boost, but--and this is just my intuition and I may well be wrong--I don't get the impression that we're there yet.
- Similarly "read from @property" seems suspiciously fast to me. Even with descriptor-protocol awareness in the class lookup cache, the overhead of calling a method seems surprisingly similar to the overhead of accessing a field. That might be explained away by the fact that property descriptors' "get" methods are guaranteed to be the simplest and easiest to optimize of all call forms (bound method, guaranteed to never be any parameters), and so the overhead of setting up the stack/frame/args may be substantially minimized...but that would only be true if the property's method body was "return 1" or something very fast. The properties tested for these benchmarks, though, are looking up other fields on the class, so I'd expect them to be a lot slower than field access, not just a little slower (https://news.ycombinator.com/item?id=46056895) and not representative. To benchmark "time it takes for the event loop to spin once and produce a result"/the python equivalent of process.nextTick, it'd be better to use low-level loop methods like "call_soon" or defer completion to a Task and await that.
tfa mentions running benchmark on a multi-core platform, but doesn't mention if benchmark results used multithreading.. a brief look at the code suggests not
Yeah... No. I've 10+ years of python under my belt and I might have had need for this kind of micro optimizations in like 2 times most
Sorry, you’re not allowed to discourage premature optimization or defend Python here.
This is AI slop.
Sad that your comment is downvoted. But yes, for those who need clarification:
1) Measurements are faulty. List of 1,000 ints can be 4x smaller. Most time measurements depend on circumstances that are not mentioned, therefore can't be reproduced.
2) Brainrot AI style. Hashmap is not "200x faster than list!", that's not how complexity works.
3) orjson/ujson are faulty, which is one of the reasons they don't replace stdlib implementation. Expect crashes, broken jsons, anything from them
4) What actually will be used in number-crunching applications - numpy or similar libraries - is not even mentioned.