Comment by edschofield

25 days ago

The design of Pandas is inferior in every way to Polars: API, memory use, speed, expressiveness. Pandas has been strictly worse since late 2023 and will never close the gap. Polars is multithreaded by default, written in a low-level language, has a powerful query engine, supports lazy, out-of memory execution, and isn’t constrained by any compatibility concerns with a warty, eager-only API and pre-Arrow data types that aren’t nullable.

It’s probably not worth incurring the pain of a compatibility-breaking Pandas upgrade. Switch to Polars instead for new projects and you won’t look back.

72 comments

edschofield

data-ottawa 25 days ago

Pandas deserves a ton of respect in my opinion. I built my career on knowing it well and using it daily for a decade, so I’m biased.

Pandas created the modern Python data stack when there was not really any alternatives (except R and closed source). The original split-apply-combine paradigm was well thought out, simple, and effective, and the built in tools to read pretty much anything (including all of your awful csv files and excel tables) and deal with timestamps easily made it fit into tons of workflows. It pioneered a lot, and basically still serves as the foundation and common format for the industry.

I always recommend every member of my teams read Modern Pandas by Tom Augspurger when they start, as it covers all the modern concepts you need to get data work done fast and with high quality. The concepts carry over to polars.

And I have to thank the pandas team for being a very open and collaborative bunch. They’re humble and smart people, and every PR or issue I’ve interacted with them on has been great.

Polars is undeniably great software, it’s my standard tool today. But they did benefit from the failures and hard edges of pandas, pyspark, dask, the tidyverse, and xarray. It’s an advantage pandas didn’t have, and they still pay for.

I’m not trying to take away from polars at all. It’s damn fast — the benchmarks are hard to beat. I’ve been working on my own library and basically every optimization I can think of is already implemented in polars.

I do have a concern with their VC funding/commercialization with cloud. The core library is MIT licensed, but knowing they’ll always have this feauture wall when you want to scale is not ideal. I think it limits the future of the library a lot, and I think long term someone will fill that niche and the users will leave.

neves 25 days ago
Is this the Modern Pandas reference you recommend?
https://tomaugspurger.net/posts/modern-1-intro/
- data-ottawa 25 days ago
  
  Yes it is
nothrowaways 25 days ago

Very well articulated.

sampo 25 days ago

Historically 18 years ago, Pandas started as a project by someone working in finance to use Python instead of Excel, yet be nicer than using just raw Python dicts and Numpy arrays.

For better or worse, like Excel and like the simpler programming languages of old, Pandas lets you overwrite data in place.

Prepare some data

    df_pandas = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [10, 20, 30, 40, 50]})
    df_polars = pl.from_pandas(df_pandas)

And then

    df_pandas.loc[1:3, 'b'] += 1

    df_pandas
       a   b
    0  1  10
    1  2  21
    2  3  31
    3  4  41
    4  5  50

Polars comes from a more modern data engineering philosopy, and data is immutable. In Polars, if you ever wanted to do such a thing, you'd write a pipeline to process and replace the whole column.

    df_polars = df_polars.with_columns(
        pl.when(pl.int_range(0, pl.len()).is_between(1, 3))
        .then(pl.col("b") + 1)
        .otherwise(pl.col("b"))
        .alias("b")
    )

If you are just interactively playing around with your data, and want to do it in Python and not in Excel or R, Pandas might still hit the spot. Or use Polars, and if need be then temporarily convert the data to Pandas or even to a Numpy array, manipulate, and then convert back.

P.S. Polars has an optimization to overwite a single value

    df_polars[4, 'b'] += 5
    df_polars
    ┌─────┬─────┐
    │ a   ┆ b   │
    │ --- ┆ --- │
    │ i64 ┆ i64 │
    ╞═════╪═════╡
    │ 1   ┆ 10  │
    │ 2   ┆ 21  │
    │ 3   ┆ 31  │
    │ 4   ┆ 41  │
    │ 5   ┆ 55  │
    └─────┴─────┘

But as far as I know, it doesn't allow slicing or anything.

richardbachman 24 days ago

`row_index()` was also recently added.

  df.with_columns(pl.col.b + pl.row_index().is_between(1, 3))
  # shape: (5, 2)
  # ┌─────┬─────┐
  # │ a   ┆ b   │
  # │ --- ┆ --- │
  # │ i64 ┆ i64 │
  # ╞═════╪═════╡
  # │ 1   ┆ 10  │
  # │ 2   ┆ 21  │
  # │ 3   ┆ 31  │
  # │ 4   ┆ 41  │
  # │ 5   ┆ 50  │
  # └─────┴─────┘

> Polars has an optimization to overwite a single value

I believe it is just "syntax sugar" for calling `Series.scatter()`[1]

> it doesn't allow slicing

I believe you are correct:

  df_polars[1:3, "b"] += 1
  # TypeError: cannot use "slice(1, 3, None)" for indexing

You can do:

  df_polars[list(range(1, 4)), "b"] += 1

Perhaps nobody has requested slice syntax? It seems like it would be easy to add.

[1]: https://github.com/pola-rs/polars/blob/9079e20ae59f8c75dcce8...

goatlover 25 days ago
The Polars code puts me off as being too verbose and requiring too many steps. I love the broadcasting ability that Pandas gets from Numpy. It's what sceintific computing should look like in my opinon. Maybe R, Julia or some array-based language does it a bit better than Numpy/Pandas, but it's certainly not like the Polars example.
- thijsn 25 days ago
  
  Polars is indeed more verbose when coming from pandas, but in my experience it is an advantage for when you're reading that same code after not having touched it for months.
  pandas is write-optimized, so you can quickly and powerfully transform your data. Once you're used to it, it allows you to quickly get your work done. But figuring out what is happening in that code after returning to it a while later is a lot harder compared to Polars, which is more read-optimized. This read-optimized API coincidentally allows the engine to perform more optimizations because all implicit knowledge about data must be typed out instead of kept in your head.
  
  2 replies →
- thereisnospork 25 days ago
  
  Likewise, I was considering trying Polaris until I saw that example. The pandas example is a good approximation of how I think and want to transform/process data even if it is ugly under the hood. I do occasionally find numpy and pandas annoying wrt when the return a view vs a copy but the cure seems worse than the disease.

satvikpendem 25 days ago

"If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton

Polars is great, but it is better precisely because it learned from all the mistakes of Pandas. Don't besmirch the latter just because it now has to deal with the backwards compatibility of those mistakes, because when it first started, it was revolutionary.

crystal_revenge 25 days ago

Can one criticize pandas by comparing to R's native DataFrames that have existed since R's inception in the 90s?
I (and many others) hated Pandas long before Polars was a thing. The main problem is that it's a DSL that doesn't really work well with the rest of Python (that and multi-index is awful outside of the original financial setting). If you're doing pure data science work it doesn't really come up, but as soon as you need to transform that work into a production solution it starts to feel quite gross.
Before Polars my solution was (and still largely remains) to do most of the relational data transformations in the data layer, and the use dicts, lists and numpy for all the additional downstream transformations. This made it much easier to break out of the "DS bubble" and incorporate solutions into main products.
vegabook 25 days ago
"revolutionary"? It just copied and pasted the decades-old R (previous "S") dataframe into Python, including all the paradigms (with worse ergonomics since it's not baked into the language).
- data-ottawa 25 days ago
  
  No other modern language will compete with R on ergonomics because of how it allows functions to read the context they’re called in, and S expressions are incredibly flexibly. The R manual is great.
  To say pandas just copied it but worse is overly dismissive. The core of pandas has always been indexing/reindexing, split-apply-combine, and slicing views.
  It’s a different approach than R’s data tables or frames.
  
  3 replies →
- sampo 25 days ago
  
  This is an interesting question.
  Dataframes first appeared in S-PLUS in 1991-1992. Then R copied S, and from 1995-1996-1997 onwards R started to grow in popularity in statistics. As free and open source software, R started to take over the market among statisticians and other people who were using other statistical software, mainly SAS, SPSS and Stata.
  Given that S and R existed, why were they mostly not picked up by data analysts and programmers in 1995-2008, and only Python and Pandas made dataframes popular from 2008 onwards?
- xtracto 25 days ago
  
  Exactly. I was programming in R in 2004 and Pandas didnt exist. I remember trying Pandas once and it felt unergonomic for fata analysis and it lacked the vast library of statistical analysis library.
- BeetleB 25 days ago
  
  It was revolutionary to Python. Without NumPy and Pandas, ML in Python would never have been a thing.
  (Yes, yes - I know some people wish that were the case!)
Xunjin 25 days ago

Indeed, even Rust was created learning with the mistakes of memory management and known patterns like the famous RAII.
bicepjai 25 days ago

With all great observations made, the quote still stands. "If I have seen further, it is by standing on the shoulders of giants" - Isaac Newton When people say I feel the sense of community, this is exactly what it means in software philosophy: we do something, others learn from it, and make better ones. In no way is the inspiration’s origin below what it inspired.

v3ss0n 25 days ago

Sounds too much like an advertisement. Also we need to watch out when diving into Polars . Polars is VC backed Opensource project with cloud offering , which may become an opencore project - we know how those goes.

gkbrk 25 days ago
> we know how those go
They get forked and stay open source? At least this is what happens to all the popular ones. You can't really un-open-source a project if users want to keep it open-source.
- stingraycharles 25 days ago
  
  Depends on your definition of popular; plenty of examples where the business interests don't align well with open source.
- v3ss0n 24 days ago
  
  not many can maintain a complex project in full time.
quentindanjou 25 days ago
I was also thinking that this comment looks like an AD. Pandas does not have any paid option and isn't made directly for profit.
- disgruntledphd2 25 days ago
  
  To be fair, as someone who's fought pandas for many years I agree with basically everything they said. The API design for Polars is much, much more intuitive. It's a base R to dplyr level change.

rdedev 25 days ago

While polars is better if you work with predefined data formats, pandas is imo still better as a general purpose table container.

I work with chemical datasets and this always involves converting SMILES string to Rdkit Molecule objects. Polars cannot do this as simply as calling .map on pandas.

Pandas is also much better to do EDA. So calling it worse in every instance is not true. If you are doing pure data manipulation then go ahead with polars

data-ottawa 25 days ago
Map is one operation pandas does nicely that most other “wrap a fast language” dataframe tools do poorly.
When it feels like you’re writing some external udf thats executed in another environment, it does not feel as nice as throwing in a lambda, even if the lambda is not ideal.
- vegabook 25 days ago
  
  you have map_elements in polars which does exactly this.
  https://docs.pola.rs/api/python/dev/reference/expressions/ap...
  You can also iter_rows into a lambda if you really want to.
  https://docs.pola.rs/api/python/stable/reference/dataframe/a...
  Personally I find it extremely rare that I need to do this given Polars expressions are so comprehensive, including when.then.otherwise when all else fails.
  
  2 replies →

rich_sasha 25 days ago

I almost fully agree. I would add that Pandas API is poorly thought through and full of footguns.

Where I certainly disagree is the "frame as a dict of time series" setting, and general time series analysis.

The feel is also different. Pandas is an interactive data analysis container, poorly suited for production use. Polars I feel is the other way round.

thelastbender12 25 days ago
I think that's a fair opinion, but I'd argue against it being poorly thought out - pandas HAS to stick with older api decisions (dating back to before data science was a mature enough field, and it has pandas to thank for much of it) for backwards compatibility.
- ohyoutravel 25 days ago
  
  Well this is like saying Python must maintain backwards compatibility with Python 2 primitives for all time. It’s simply not true. It’s not easy to deprecate an old API, but it’s doable and there are playbooks for it. Pandas is good, I’ve used it extensively, but agree it’s not fit for production use. They could catch up to the state of the art, but that requires them being very opinionated and willing to make some unpopular decisions for the greater good.
  
  1 reply →
- ptman 25 days ago
  
  3.0 is the perfect place to break compat
sirfz 25 days ago

I think that's a sane take. Indeed, I think most data analysts find it much easier to use pandas over polars when playing with data (mainly the bracket syntax is faster and mostly sensible)

lairv 25 days ago

I would agree if not for the fact that polars is not compatible with Python multiprocessing when using the default fork method, the following script hangs forever (the pandas equivalent runs):

    import polars as pl
    from concurrent.futures import ProcessPoolExecutor

    pl.DataFrame({"a": [1,2,3], "b": [4,5,6]}).write_parquet("test.parquet")

    def read_parquet():
        x = pl.read_parquet("test.parquet")
        print(x.shape)

    with ProcessPoolExecutor() as executor:
        futures = [executor.submit(read_parquet) for _ in range(100)]
        r = [f.result() for f in futures]

Using thread pool or "spawn" start method works but it makes polars a pain to use inside e.g. PyTorch dataloader

skylurk 25 days ago

You are not wrong, but for this example you can do something like this to run in threads:

  import polars as pl
  
  pl.DataFrame({"a": [1, 2, 3]}).write_parquet("test.parquet")
  
  
  def print_shape(df: pl.DataFrame) -> pl.DataFrame:
      print(df.shape)
      return df
  
  
  lazy_frames = [
      pl.scan_parquet("test.parquet")
      .map_batches(print_shape)
      for _ in range(100)
  ]
  pl.collect_all(lazy_frames, comm_subplan_elim=False)

(comm_subplan_elim is important)

ritchie46 25 days ago
Python 3.14 "spawns" by default.
However, this is not a Polars issue. Using "fork" can leave ANY MUTEX in the system process invalid (a multi-threaded query engine has plenty of mutexes). It is highly unsafe and has the assumption that none of you libraries in your process hold a lock at that time. That's an assumption that's not PyTorch dataloaders to make.
- lairv 25 days ago
  
  Default to "spawn" is definitely the right thing, it avoids many footguns
  That said for PyTorch DataLoader specifically, switching from fork to spawn removes copy-on-write, which can significantly increase startup time and more importantly memory usage. It often requires non-trivial refactors, many training codebase aren't designed for this and will simply OOM. So in practice for this use case, I've found it more practical to just use pandas rather than doing a full refactor
schmidtleonard 25 days ago
I can't believe parallel processing is still this big of a dumpster fire in python 20 years after multi-core became the rule rather than the exception.
Do they really still not have a good mechanism to toss a flag on a for loop to capture embarrassing parallelism easily?
- ritchie46 25 days ago
  
  Polars does that for you.
- skylurk 25 days ago
  
  This is one of the reasons I use polars.
- lairv 25 days ago
  
  Well I think ProcessPoolExecutor/ThreadPoolExecutor from concurrent.futures were supposed to be that

torcete 25 days ago

I didn't know about polars, and I can see that they also have a library for R. However, in R, they have a fiercer competition. I wonder how it compares to tidyverse, which is the stablished data analysis library.

datsci_est_2015 25 days ago

Might be cool once PySpark integrates with Polars, but for now like many others I’m stuck with dropping into pandas for non-vectorized operations

jvican 25 days ago
Is there any plan for this?
- devin-petersohn 25 days ago
  
  Funny enough, I actually just (2 weeks ago) added support for streaming from Pyspark to Polars/DuckDB/etc through Arrow PyCapsule. By streaming, I mean actually streaming, not collecting all data at once. It won't be released probably until May/June but it's there: https://github.com/apache/spark/commit/ecf179c3485ba8bac72af...
- datsci_est_2015 25 days ago
  
  Not that I’m aware of. The Spark ecosystem seems a little too “stable” to be putting effort into that kind of development.
  Edit: hah, based on the sibling comment, I stand corrected

bovermyer 25 days ago

As someone who just encountered Pandas for the first time as part of an Intro to Data Visualization course a few weeks ago, I am now very curious about Polars.

The professor doesn't actually care which tool we use as long as we produce nice graphs, so this is as good a time as any to experiment.

__mharrison__ 25 days ago

"every way" is strong words.

Pandas is better for plotting and third party integration.

vaylian 25 days ago

> The design of Pandas is inferior in every way to Polars

I used Pandas a lot with Jupyter notebooks. I don't have any experience with Polars. Is it also possible to work with Polars dataframes in Jupyter notebooks?

disgruntledphd2 25 days ago

Yes. Most things just work with Polars. The one issue for me is the need for geopandas.

bhadass 25 days ago

why not just go full bore to duckdb?

data-ottawa 25 days ago

A dataframe API allows you to write code in Python, with native syntax highlighting and your LSP can complete it, in one analysis file. Inlined SQL is not as nice, and has weird ergonomics.
UDFs in most dataframe libraries tend to feel better than writing udfs for a sql engine as well.
Polars specifically has lazy mode which enables a query optimizer, so you get predicate push down and all the goodies if SQL, with extra control/primitives (sane pivoting, group_by_dynamic, etc)
I do use ibis on top of duckdb sometimes, but the UDF situation persists and the way they organize their docs is very difficult to use.
vegabook 25 days ago
because method chaining in Polars is much more composable and ergonomic than SQL once the pipeline gets complex which makes it superior in an exploratory "data wrangling" environment.
- data-ottawa 25 days ago
  
  Duckdb does support pipe operators as an extension, which is a welcome addition to sql engines for me.
  But I do agree with you.

bikelang 25 days ago

All of this is true and I agree with you - but this comment comes off a bit disrespectful.

pelasaco 25 days ago

are many of the mentioned issues not just some vibe-code sessions away from done?

noitpmeder 25 days ago
Give it a shot and report back when you get them merged
- pelasaco 24 days ago
  
  not my circus not my monkeys

noo_u 25 days ago

Polars took a lot of ideas from Pandas and made them better - calling it "inferior in every way" is all sorts of disrespectful :P

Unfortunately, there are a lot of third party libraries that work with Pandas that do not work with Polars, so the switch, even for new projects, should be done with that in mind.

skylurk 25 days ago
Luckily, polars has .to_pandas() so you can still pass pandas dataframes to the libraries that really are still stuck on that interface.
I maintain one of those libraries and everything is polars internally.
- adolph 25 days ago
  
  > pandas dataframes
  Didn't Pandas move to Arrow, matching Polars, in version 2?
- noo_u 25 days ago
  
  to_pandas has a dependency on pandas - it is not the biggest of deals, but worth keeping in mind.