The design of Pandas is inferior in every way to Polars: API, memory use, speed, expressiveness. Pandas has been strictly worse since late 2023 and will never close the gap. Polars is multithreaded by default, written in a low-level language, has a powerful query engine, supports lazy, out-of memory execution, and isn’t constrained by any compatibility concerns with a warty, eager-only API and pre-Arrow data types that aren’t nullable.
It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
Historically 18 years ago, Pandas started as a project by someone working in finance to use Python instead of Excel, yet be nicer than using just raw Python dicts and Numpy arrays.
For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.
Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column.
If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.
P.S. Polars has an optimization to overwite a single value
Polars took a lot of ideas from Pandas and made them better - calling it "inferior in every way" is all sorts of disrespectful :P
Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.
I think that's a fair opinion, but I'd argue against it being poorly thought out - pandas HAS to stick with older api decisions (dating back to before data science was a mature enough field, and it has pandas to thank for much of it) for backwards compatibility.
I think that's a sane take. Indeed, I think most data analysts find it much easier to use pandas over polars when playing with data (mainly the bracket syntax is faster and mostly sensible)
Sounds too much like an advertisement.
Also we need to watch out when diving into Polars . Polars is VC backed Opensource project with cloud offering , which may become an opencore project - we know how those goes.
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs):
import polars as pl
from concurrent.futures import ProcessPoolExecutor
pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")
def read_parquet():
x = pl.read_parquet("test.parquet")
print(x.shape)
with ProcessPoolExecutor() as executor:
futures = [executor.submit(read_parquet) for _ in range(100)]
r = [f.result() for f in futures]
Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader
I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception.
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
I've migrated off of pandas to polars for my workflows to reap the benefit of, in my experience a 10-20x speedup on average. I can't imagine anything bringing me back short of a performance miracle. LLMs have made syntax almost a non-barrier.
Went from pandas to polars to duckdb. As mentioned elsewhere SQL is the most readable for me and LLM does most of the coding on my end (quant). So I need it at the most readable and rudimentary/step-wise level.
OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
<< It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.
Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?
It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.
Same. I don't even use LLM normally as I found polars' syntax to be very intuitive. I just searched my ChatGPT history and the only times I used it are when I'm dealing with list and struct columns that were not in pandas.
iirc part of pandas’ popularity was that it modeled some of R’s ergonomics. What a time in history, when such things mattered! (To be clear, I’m not making fun of pandas. It was the bridge I crossed that moved me from living in Excel to living in code.)
Yes, ChatGPT 5.2 Pro absolutely still does this. Just ask it for a pivot table using Polars and it will probably spit out code with Pandas arguments that doesn’t work.
That was probably about what I got when I migrated some heavy number crunching code from Pandas to Polars a few years ago. Maybe even better than that.
I have deep respect for Pandas, it, and Jupyter-lab were my intro to programming. And it worked much better for me, I did some "intro to Python" courses, but it was all about strs and ints. And yes, you can add strs together! Wow magic... Not for me. For me it all clicked when I first looped through a pile of Excel files (pd.read_excel()), extracted info I needed and wrote a new Excel file... Mind blown.
From there, of course, you slowly start to learn about types etc, and slowly you start to appreciate libraries and IDEs. But I knew tables, and statistics and graphs, and Pandas (with the visual style of Notebooks) lead me to programming via that familiar world. At first with some frustration about Pandas and needing to write to Excel, do stuff, and read again, but quickly moving into the opposite flow, where Excel itself became the limiting factor and being annoyed when having to use it.
I offered some "Programming for Biologists" courses, to teach people like me to do programming in this way, because it would be much less "dry" (pd.read.excel().barplot() and now you're programming). So far, wherever I offered the courses they said they prefer to teach programming "from the base up". Ah well! I've been told I'm not a programmer, I don't care. I solve problems.
The design of Pandas is inferior in every way to Polars: API, memory use, speed, expressiveness. Pandas has been strictly worse since late 2023 and will never close the gap. Polars is multithreaded by default, written in a low-level language, has a powerful query engine, supports lazy, out-of memory execution, and isn’t constrained by any compatibility concerns with a warty, eager-only API and pre-Arrow data types that aren’t nullable.
It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.
Historically 18 years ago, Pandas started as a project by someone working in finance to use Python instead of Excel, yet be nicer than using just raw Python dicts and Numpy arrays.
For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.
Prepare some data
And then
Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column.
If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.
P.S. Polars has an optimization to overwite a single value
But as far as I know, it doesn't allow slicing or anything.
Polars took a lot of ideas from Pandas and made them better - calling it "inferior in every way" is all sorts of disrespectful :P
Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.
Luckily, polars has .to_pandas() so you can still pass pandas dataframes to the libraries that really are still stuck on that interface.
I maintain one of those libraries and everything is polars internally.
1 reply →
I almost fully agree. I would add that Pandas API is poorly thought through and full of footguns.
Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.
The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.
I think that's a fair opinion, but I'd argue against it being poorly thought out - pandas HAS to stick with older api decisions (dating back to before data science was a mature enough field, and it has pandas to thank for much of it) for backwards compatibility.
2 replies →
I think that's a sane take. Indeed, I think most data analysts find it much easier to use pandas over polars when playing with data (mainly the bracket syntax is faster and mostly sensible)
Sounds too much like an advertisement. Also we need to watch out when diving into Polars . Polars is VC backed Opensource project with cloud offering , which may become an opencore project - we know how those goes.
> we know how those go
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
1 reply →
I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs):
Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader
I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception.
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
1 reply →
why not just go full bore to duckdb?
I've migrated off of pandas to polars for my workflows to reap the benefit of, in my experience a 10-20x speedup on average. I can't imagine anything bringing me back short of a performance miracle. LLMs have made syntax almost a non-barrier.
Went from pandas to polars to duckdb. As mentioned elsewhere SQL is the most readable for me and LLM does most of the coding on my end (quant). So I need it at the most readable and rudimentary/step-wise level.
OT, but I can’t imagine data science being a job category for too long. It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
<< It’s got to be one of the first to go in AI age especially since the market is so saturated with mediocre talents.
This is interesting. I wanted to dig into it a little since I am not sure I am following the logic of that statement.
Do you mean that AI would take over the field, because by default most people there are already not producing anything that a simple 'talk to data' LLM won't deliver?
1 reply →
also migrated, but to duckdb.
It's funny to look back at the tricks that were needed to get gpt3 and 3.5 to write SQL (e.g. "you are a data analyst looking at a SQL database with table [tables]"). It's almost effortless now.
Same. I don't even use LLM normally as I found polars' syntax to be very intuitive. I just searched my ChatGPT history and the only times I used it are when I'm dealing with list and struct columns that were not in pandas.
iirc part of pandas’ popularity was that it modeled some of R’s ergonomics. What a time in history, when such things mattered! (To be clear, I’m not making fun of pandas. It was the bridge I crossed that moved me from living in Excel to living in code.)
1 reply →
Polars being so fast, and embeddable into other languages, has made it a no brainer for me to adopt it.
I have integrated Explorer https://github.com/elixir-explorer/explorer, which leverages it, into many Elixir apps, so happy to have this.
Do you not experience LLM generated code constantly trying to use Pandas' methods/syntax for Polars objects?
Yes, ChatGPT 5.2 Pro absolutely still does this. Just ask it for a pivot table using Polars and it will probably spit out code with Pandas arguments that doesn’t work.
1 reply →
There were some growing pains in gpt-3.5 to gpt-4 era, but not nowadays (shoutout to the now-defunct Phind, which was a game changer back then).
2 replies →
" 10-20x speedup on average. "
Is this everyone's experience?
That was probably about what I got when I migrated some heavy number crunching code from Pandas to Polars a few years ago. Maybe even better than that.
Same, also polars works on typescript which I used at some point out move my data from backend to frontend
The speedup you claim is going to be contingent on how you use Pandas, with which data types, and which version of Pandas.
That timestamp resolution discrepancy is going to cause so many problems
Haven't used pandas in a while, but Copy-on-Write sounds pretty cool! Is there any public benchmark I can check in 2026?
How soon will the leading LLMs ingest the updated documentation? Because I'm certainly not going to.
This is the most misunderstood aspect of how marketing has changed recently
Use context7 mcp. It'll do the trick
In my experience, it would take a year to ingest it natively, and two years to also ingest enough coding examples.
s/impactfull/impactful
Regex is great when one is communicating with machines
I have deep respect for Pandas, it, and Jupyter-lab were my intro to programming. And it worked much better for me, I did some "intro to Python" courses, but it was all about strs and ints. And yes, you can add strs together! Wow magic... Not for me. For me it all clicked when I first looped through a pile of Excel files (pd.read_excel()), extracted info I needed and wrote a new Excel file... Mind blown.
From there, of course, you slowly start to learn about types etc, and slowly you start to appreciate libraries and IDEs. But I knew tables, and statistics and graphs, and Pandas (with the visual style of Notebooks) lead me to programming via that familiar world. At first with some frustration about Pandas and needing to write to Excel, do stuff, and read again, but quickly moving into the opposite flow, where Excel itself became the limiting factor and being annoyed when having to use it.
I offered some "Programming for Biologists" courses, to teach people like me to do programming in this way, because it would be much less "dry" (pd.read.excel().barplot() and now you're programming). So far, wherever I offered the courses they said they prefer to teach programming "from the base up". Ah well! I've been told I'm not a programmer, I don't care. I solve problems.